Horizon-Free and Variance-Dependent Reinforcement Learning for Latent Markov Decision Processes

2025-05-06 0 0 482.46KB 26 页 10玖币

侵权投诉

arXiv:2210.11604v3 [cs.LG] 21 May 2023

Horizon-Free and Variance-Dependent Reinforcement Learning for Latent

Markov Decision Processes

Runlong Zhou 1Ruosong Wang 1Simon S. Du 1

Abstract

We study regret minimization for reinforcement

learning (RL) in Latent Markov Decision Pro-

cesses (LMDPs) with context in hindsight. We

design a novel model-based algorithmic frame-

work which can be instantiated with both a

model-optimistic and a value-optimistic solver.

We prove an r

Op?Var‹MΓSAKqregret bound

where r

Ohides logarithm factors, Mis the num-

ber of contexts, Sis the number of states, Ais the

number of actions, Kis the number of episodes,

ΓďSis the maximum transition degree of any

state-action pair, and Var‹is a variance quantity

describing the determinism of the LMDP. The re-

gret bound only scales logarithmically with the

planning horizon, thus yielding the ﬁrst (nearly)

horizon-free regret bound for LMDP. This is

also the ﬁrst problem-dependent regret bound for

LMDP. Key in our proof is an analysis of the to-

tal variance of alpha vectors (a generalization of

value functions), which is handled with a trun-

cation method. We complement our positive re-

sult with a novel Ωp?Var‹MSAKqregret lower

bound with Γ“2, which shows our upper bound

minimax optimal when Γis a constant for the

class of variance-bounded LMDPs. Our lower

bound relies on new constructions of hard in-

stances and an argument inspired by the sym-

metrization technique from theoretical computer

science, both of which are technically different

from existing lower bound proof for MDPs, and

thus can be of independent interest.

1Paul G. Allen School of Computer Science & Engineering,

University of Washington, Seattle, WA, USA. Correspondence to:

Simon S. Du <ssdu@cs.washington.edu>.

Proceedings of the 40 th International Conference on Machine

Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright

2023 by the author(s).

1. Introduction

One of the most popular model for Reinforcement Learn-

ing(RL) is Markov Decision Process (MDP), in which the

transitions and rewards are dependent only on current state

and agent’s action. In standard MDPs, the agent has full

observation of the state, so the optimal policy for the agent

also only depends on states (called a history-independent

policy). There is a line of research on MDPs, and the min-

imax regret and sample complexity guarantees have been

derived.

Another popular model is Partially Observable MDPs

(POMDPs) in which the agent only has partial observa-

tions of states. Even though the underlying transition is

still Markovian, the lower bound for sample complexity

has been proven to be exponential in state and action sizes.

This is in part because the optimal policies for POMDPs

are history-dependent.

In this paper we focus on a middle ground between MDP

and POMDP, namely Latent MDP (LMDP). An LMDP

can be viewed as a collection of MDPs sharing the same

state and action spaces, but the transitions and rewards

may vary across them. Each MDP has a probability to

be sampled at the beginning of each episode, and it will

not change during the episode. The agent needs to ﬁnd

a policy which works well on these MDPs in an average

sense. Empirically, LMDPs can be used for a wide

variety of applications (Yu et al.,2020;Iakovleva et al.,

2020;Finn et al.,2018;Ramamoorthy et al.,2013;

Doshi-Velez & Konidaris,2016;Yao et al.,2018). In

general, there exists no policy that is optimally on every

single MDP simultaneously, so this task is deﬁnitely harder

than MDPs. On the other hand, LMDP is a special case

of POMDP because for each MDP, the unobserved state is

static in each episode and the observable state is just the

state of MDP.

Unfortunately, for generic LMDPs, there exists exponen-

tial sample complexity lower bound (Kwon et al.,2021),

so additional assumptions are needed to make the problem

tractable. In this paper, we consider the setting that after

each episode ends, the agent will get the context on which

MDP it played with. This is called context in hindsight.

Horizon-Free and Variance-Dependent RL for LMDPs

Such information is often available. For example, in a maze

navigation task, the location of the goal state can be viewed

as the context.

In this setting, Kwon et al. (2021) obtained an

Op?MS2AHKq1regret upper bound where His

the planning horizon. They did not study the regret lower

bound. To benchmark this result, the only available bound

is Ωp?SAKqfrom standard MDP by viewing MDP as a

special case of LMDP.

When we view MDP as a special case of LMDP, there

are problem-dependent results (Zanette & Brunskill,2019;

Bartlett & Tewari,2012;Fruit et al.,2018;Maillard et al.,

2014;Jin et al.,2020;Wagenmaker et al.,2022) which

raise our attention. RL algorithms often perform far better

on MDPs with special structures than what their worst-case

guarantees would suggest. Algorithms with a problem-

dependent regret guarantee should automatically adapt to

the MDP instance without the prior knowledge of problem-

dependent quantities.Zanette & Brunskill (2019) provides

an algorithm for MDPs whose regret scales with the max-

imum per-step conditional variance of the MDP instance.

This quantity, Q‹, is determined by the value function of

the optimal policy. Their regret bound r

Op?HQ‹¨SAK `

H5{2S2Aqreduces to a constant r

OpH5{2S2Aqwhen the

MDP is deterministic (Q‹“0). This work motivates us to

further study how to have a variance-dependent regret.

Comparing these bounds, we ﬁnd signiﬁcant gaps: ①Is

the dependency on Min LMDP necessary? ②The bound

for MDP is (nearly) horizon-free (no dependency on H),

is the polynomial dependency on Hin LMDP necessary?

③If the LMDP is reduced to a deterministic MDP, can the

algorithm automatically have a constant regret (up to log-

arithm factors)? Further, is it possible to have variance-

dependency in the regret bound for LMDPs? ④The de-

pendency on the number of states is ?Sfor MDP but the

bound in (Kwon et al.,2021) for LMDP is S.

In this paper, we resolve the ﬁrst three questions and par-

tially answer the fourth.

1.1. Main contributions and technical novelties

We obtain the following new results:

‚A suitable notation of variances for LMDPs. There

is no easy notion of Q‹under the LMDP setting, because

different optimal policies may have different behaviors in a

certain MDP. The optimal alpha vector (the counterpart of

value function for LMDPs) is not unique, so it is hard to de-

1Their original bound is r

Op?MS2AH3Kqwith the scaling

that the reward from each step is bounded by 1. We rescale the

reward to be bounded by 1{Hin order to make the total reward

from each episode bounded by 1, which is the setting we consider.

ﬁne a variance quantity depending only on the optimal al-

pha vector. We deﬁne a suitable notation of variance, Var‹,

which is the maximum variance of total rewards induced

by any deterministic policy (see Section 4). By deﬁnition,

Var‹ďOp1qbecause the total reward is bounded by 1, so

a regret bound that depends on ?Var‹will automatically

reduce to the worst-case regret bound.

‚Near-optimal regret guarantee for LMDPs. We

present an algorithm framework for LMDPs with context in

hindsight. This framework can be instantiated with a plug-

in solver for planning problems. We consider two types

of solvers, one model-optimistic and one value-optimistic,

and prove their regrets to be r

Op?Var‹MΓSAK `MS2Aq

where ΓďSis the maximum transition degree of any

state-action pair. Compared with the result in (Kwon et al.,

2021), ours only requires the total reward to be bounded

whereas they required a bounded reward for each step.

We improve the H-dependency from ?Hto logarithmic,

making our bound (nearly) horizon-free. Furthermore, our

bound scales with ?SΓ, which is strictly better than S

in their bound. Finally, our main order term is variance-

dependent, which means when the LMDP reduces to a de-

terministic MDP (Var‹“0), our regret is a constant up to

logarithm factors.

The main technique of our model-optimistic algorithm is

to use a Bernstein-type conﬁdence set on each position

of transition dynamics, leading to a small Bellman error.

The main difference between our value-optimistic algo-

rithm and Kwon et al. (2021)’s is that we use a bonus de-

pending on the variance of next-step values according to

Bennett’s inequality, instead of using Bernstein’s inequal-

ity. It helps propagate the optimism from the last step to

the ﬁrst step, avoiding the ?H-dependency. We analyse

these two solvers in a uniﬁed way, as their Bellman error

are of the same order. We derive the variance-dependent

regret by upper-bounding the total variance of estimated

alpha vectors using martingale concentration inequalities

with a truncation method.

‚New regret lower bound for LMDPs. We obtain a

novel Ω`?VMSAK˘minimax regret lower bound for

the class of LMDPs with Var‹ďOpVq. This re-

gret lower bound shows the dependency on Mis neces-

sary for LMDPs. Notably the lower bound also implies

O´?Var‹MΓSAK¯upper bound is optimal up to a ?Γ

factor. Furthermore, our lower bound holds even for Γ“2,

which shows our upper bound is minimax optimal for a

class of LMDPs with Γ“Op1q. Maze navigation prob-

lems are a typical class with ΓďOp1q: the agent can only

have at most 4next states to transition into (up, down, left,

right).

Our proof relies on new constructions of hard instances,

Horizon-Free and Variance-Dependent RL for LMDPs

different from existing ones for MDPs (Domingues et al.,

2021). In particular, we use a two-phase structure to con-

struct hard instances (cf. Figure 1). Furthermore, the pre-

vious approaches for proving lower bounds of MDPs do

not work on LMDPs. For example, in the MDP instance

of Domingues et al. (2021), the randomness comes from

the algorithm and the last transition step before entering

the good state or bad state. In an LMDP, the randomness

of sampling the MDP from multiple MDPs must also be

considered. Such randomness not only dilutes the value

function by averaging over each MDP, but also divides

the pushforward measure (see Page 3 of Domingues et al.

(2021)) into Mparts. As a result, the Mterms in KL diver-

gence in Equation (2) of Domingues et al. (2021) and that

in Equation (10) cancels out — the ﬁnal lower bound does

not contain M. To overcome this, we adopt the core spirit

of the symmetrization technique from theoretical computer

science. We randomly give a single MDP of a M-MDP

LMDP to the agent, while making it unable to distinguish

between which position this MDP is in the whole system.

The idea to reduce Mcomponents to 1component shares

the same spirit as that of symmetrization (Section 1.1.1 of

Phillips et al. (2012)). This novel technique helps general-

ize the bounds from a single-party result to a multiple-party

result, which gives rise to a tighter lower bound.

2. Related Works

LMDPs. As shown by Steimle et al. (2021), in the gen-

eral cases, optimal policies for LMDPs are history depen-

dent and P-SPACE hard to ﬁnd. This is different from

standard MDP cases where there always exists an opti-

mal history-independent policy. However, even ﬁnding the

optimal history-independent policy is NP-hard (Littman,

1994). Chades et al. (2012) provided heuristics for ﬁnding

the optimal history-independent policy.

Kwon et al. (2021) investigated the sample complexity and

regret bounds of LMDPs. Speciﬁcally, they presented an

exponential lower-bound for general LMDPs without con-

text in hindsight, and then they derived an algorithm with

polynomial sample complexity and sub-linear regret for

two special cases (with context in hindsight, or δ-strongly

separated MDPs).

LMDP has been studied as a type of multi-task RL

(Taylor & Stone,2009;Brunskill & Li,2013;Liu et al.,

2016;Hallak et al.,2015). It has been applied to

model combinatorial optimization problems (Zhou et al.,

2022). There are also some related studies such as

model transfer (Lazaric,2012;Zhang & Wang,2021)

and contextual decision processes (Jiang et al.,2017).

In empirical works, LMDP has has wide applica-

tions in multi-task RL (Yu et al.,2020), meta RL

(Iakovleva et al.,2020;Finn et al.,2018), latent-variable

MDPs (Ramamoorthy et al.,2013) and hidden parameter

MDPs (Doshi-Velez & Konidaris,2016;Yao et al.,2018).

Regret Analysis for MDPs. LMDPs are generalizations

of MDPs, so some previous approaches to solving MDPs

can provide insights. There is a long line of work on regret

analysis for MDPs (Azar et al.,2017;Dann et al.,2017;

2019;Zanette & Brunskill,2019;Zhang et al.,2021a). In

this paper, we focus on time-homogeneous, ﬁnite horizon,

undiscounted MDPs whose total reward is upper-bounded

by 1. Recent work showed in this setting the regret can

be (nearly) horizon-free for tabular MDPs (Wang et al.,

2020;Zhang et al.,2021a;2022;2020;Ren et al.,2021).

Importantly these results indicate RL may not be more

difﬁcult than bandits in the minimax sense. More re-

cent work generalized the horizon-free results to other

MDP problems (Zhang et al.,2021b;Kim et al.,2021;

Tarbouriech et al.,2021;Zhou & Gu,2022). However, all

existing work with horizon-free guarantees only considered

single-environment problems. Ours is the ﬁrst horizon-free

guarantee that goes beyond MDP.

Neu & Pike-Burke (2020) summarized up the “optimism

in the face of uncertainty” (OFU) principle in RL. They

named two types of optimism: ①model-optimistic algo-

rithms construct conﬁdence sets around empirical transi-

tions and rewards, and select the policy with the highest

value in the best possible models in these sets. ②value-

optimistic algorithms construct upper bounds on the opti-

mal value functions, and select the policy which maximizes

this optimistic value function. Our paper follows their idea

and provide one algorithm for each type of optimism.

Variance-dependent regrets for MDPs. Variance-

dependent regrets have been studied under the MDP

setting. Talebi & Maillard (2018) provides a regret bound

that scales with the variance of the next step value func-

tions under strong assumptions on ergodicity of the MDP.

Namely, they deﬁne V‹

s,a for each ps, aqpair and derives

a regret of ˜

OpbSřs,a V‹

s,aTqunder the inﬁnite horizon

setting.

Simchowitz & Jamieson (2019) combines gap-dependent

regret with variances. The standard notation gapps, aqis

the gap between the optimal value function and the opti-

mal Q-function, and gapmin is the minimum non-zero gap.

Let Var‹

h,s,a be the variance of optimal value function at

ph, s, aqtriple, their regret approximately scales as

O˜ÿ

s,a

HmaxhVar‹

h,s,a

maxtgapps, aq,gapminulogpKq¸.

Variance-aware bounds also exist in bandits (Kim et al.,

2021;Zhang et al.,2021b;Zhou et al.,2021;Zhao et al.,

Horizon-Free and Variance-Dependent RL for LMDPs

2023;2022) and linear MDPs (Li & Sun,2023).

3. Problem Setup

In this section, we give a formal deﬁnition of Latent

Markov Decision Processes (Latent MDPs).

Notations. For any event E, we use rEsto denote the

indicator function, i.e., rEs “ 1if Eholds and rEs “

0otherwise. For any set X, we use ∆pXqto denote

the probability simplex over X. For any positive inte-

ger n, we use rnsto denote the set t1,2,...,nu. For

any probability distribution P, we use supppPq “ }P}0

to denote the size of support of P, i.e., řxrPpxq ą

0s. There are three ways to denote a d-dimensional

vector (function): suppose pis any parameter, xppq “

px1ppq, x2ppq,...,xdppqq if the indices are natural num-

bers, xp¨|pq “ pxpi1|pq, xpi2|pq,...,xpid|pqq and xpp¨q “

pxppi1q, xppi2q,...,xppidqq if the indices are from the set

I“ ti1, i2,...,idu. For any number q, we use xqto de-

note the vector pxq

1, xq

2,...,xq

dq. For two d-dimensional

vectors xand y, we use xJy“řixiyito denote the

inner product. If xis a probability distribution, we use

Vpx, yq “ řixipyi´xJyq2“xJpy2q´pxJyq2to denote

the empirical variance. We use ι“2 ln `2M SAHK

δ˘as a

log term where δis the conﬁdence parameter.

3.1. Latent Markov Decision Process

Latent MDP (Kwon et al.,2021) is a collection of ﬁnitely

many MDPs M“ tM1,M2,...,MMuwhere M“

|M|. All the MDPs share state set S, action set Aand

horizon H. Each MDP Mm“ pS,A, H, νm, Pm, Rmq

has its own initial state distribution νmP∆pSq, transition

model Pm:SˆAÑ∆pSqand a deterministic reward

function Rm:SˆAÑ r0,1s. Let w1, w2,...,wMbe the

mixing weights of MDPs such that wmą0for any mand

řM

m“1wm“1.

Denote S“ |S|, A “ |A|and Γ“

maxm,s,a supp pPmp¨|s, aqq.Γcan be interpreted as

the maximum degree of each transition, which is a quantity

our regret bound depends on. Note we always have ΓďS.

In previous work, Lattimore & Hutter (2012) assumes

Γ“2, and Fruit et al. (2020) also has a regret bound that

scales with Γ.

In the worst case, the optimal policy of an LMDP is history-

dependent and is PSPACE-hard to ﬁnd (Corollary 1 and

Proposition 3 in Steimle et al. (2021)). Aside from compu-

tational difﬁculty, storing a history-dependent policy needs

a space which is exponentially large, so it is generally im-

practical. In this paper, we seek to provide a result for

any ﬁxed policy class Π. For example, we can have Π

to be the set of all history-independent, deterministic poli-

cies to alleviate the space issue. Following previous work

(Kwon et al.,2021), we assume access to oracles for plan-

ning and optimization. See Section 5for the formal deﬁni-

tions.

We consider an episodic, ﬁnite-horizon and undiscounted

reinforcement learning problem on LMDPs. In this prob-

lem, the agent interacts with the environment for K

episodes. At the start of every episode, one MDP MmP

Mis randomly chosen with probability wm. Throughout

the episode, the true context is hidden. The agent can only

choose actions based on the history information up until

current time. However, at the end of each episode (after

Hsteps), the agent gets revealed the true context m. This

permits an unbiased model estimation for the LMDP. As in

Cohen et al. (2020), the central difﬁculty is to estimate the

transition, we also focus on learning Ponly. For simplic-

ity, we assume that wand νare known to the agent, because

they can be estimated easily.

Conditions on rewards. We assume that Ris deter-

ministic and known to the agent. The assumption is for

simplicity, and our analysis can be extended to unknown

and bounded-support reward distributions. We study the

LMDPs that the total reward within an episode is upper-

bounded by 1almost surely for any policy. This condition

poses more difﬁculty to the design of a horizon-free algo-

rithm, because the ordinary case of uniform-bounded re-

wards can be converted to total-bounded rewards by mul-

tiplying 1{H. Under this condition, an algorithm needs to

consider a spike of reward at certain step.

3.2. Value functions, Q-functions and alpha vectors

By convention, the expected reward of executing a policy

on any MDP can be deﬁned via value function Vand Q-

function Q. Since for MDPs there is always an optimal

policy which is history-independent, Vand Qonly need

the current state and action as parameters.

However, these notations fall short of history-independent

policies under the LMDP setting. The full information

is encoded in the history, so here we use a more gen-

eralized deﬁnition called alpha vector (following nota-

tions in Kwon et al. (2021)). For any time tě1, let

ht“ ps, a, rq1:t´1stbe the history up until time t. De-

ﬁne Htas the set of histories observable at time step t, and

H:“ YH

t“1Htas the set of all possible histories. We deﬁne

the alpha vectors απ

mphqfor pm, hq P rMsˆHas follows:

απ

mphq:“Eπ,Mm«H

t1“t

Rmpst1, at1qˇˇˇˇˇht“hﬀ,

απ

mph, aq:“Eπ,Mm«H

t1“t

Rmpst1, at1qˇˇˇˇˇpht, atq “ ph, aqﬀ.

Horizon-Free and Variance-Dependent RL for LMDPs

The alpha vectors are indeed value functions and Q-

functions on each individual MDP.

Next, we introduce the concepts of belief state to show how

to do planning in LMDP. Let bmphqdenote the belief state

over MMDPs corresponding to a history h, i.e., the proba-

bility of the true MDP being Mmconditioned on observing

history h. We have the following recursion:

bmpsq “ wmνmpsq

řM

m1“1wm1νm1psq,

bmphars1q “ bmphqPmps1|s, aqrr“Rmps, aqs

řM

m1“1bm1phqPm1ps1|s, aqrr“Rm1ps, aqs.

The value functions and Q-functions for LMDP is deﬁned

via belief states and alpha vectors:

Vπphq:“bphqJαπphqand Qπph, aq:“bphqJαπph, aq.

Direct computation (see Appendix B.1) gives

Vπphq “ ÿ

aPA

πpa|hqQπph, aq “ ÿ

aPA

πpa|hq˜bphqJRps, aq

`ÿ

s1PS,r

m1“1

bm1phqPm1ps1|s, aqrr“Rm1ps, aqsVπphars1q¸.

So planning in LMDP can be viewed as planning in belief

states. For the optimal history-dependent policy, we can

select

πphq “ arg max

aPA˜bphqJRps, aq

`ÿ

s1PS,r

m1“1

bm1phqPm1ps1|s, aqrr“Rm1ps, aqsVπphars1q¸,

(1)

using dynamic programming in descending order of h’s

length.

3.3. Performance measure

We use cumulative regret to measure the algorithm’s per-

formance. The optimal policy is π‹“arg maxπPΠVπ,

which also does not know the context when interacting with

the LMDP. Suppose the agent interacts with the environ-

ment for Kepisodes, and for each episode ka policy πkis

played. The regret is deﬁned as

RegretpKq:“

k“1pV‹´Vπkq.

Since the planning problem for LMDPs is time-consuming,

we may assume access to an efﬁcient planning-oracle (e.g.,

greedy algorithm, approximation algorithm) with the fol-

lowing performance guarantee (Kwon et al.,2021): given

M, the policy πit returns satisﬁes Vπ

Měρ1V‹

M´ρ2.

Then we deﬁne the regret as:

RegretpKq:“

k“1pρ1V‹´ρ2´Vπkq.

Notice that Č

RegretpKq ď ρ1RegretpKq, so we essentially

study the upper bound of RegretpKq.

4. Variance for LMDPs

We introduce the maximum policy-value variance, which is

novel in the literature.

Deﬁnition 1. For any policy πPΠ, its maximum value

variance is deﬁned as

Varπ:“Vpw˝ν, απ

¨p¨qq

`Eπ«H

t“1

VpPmp¨|st, atq, απ

mphtatrt¨qqﬀ.

The maximum policy-value variance for a particular

LMDP is deﬁned as:

Var‹:“max

πPΠVarπ.

Varπis the variance of total reward of π, and the justiﬁca-

tion can be found in Appendix B.2.

5. Main Algorithms and Results

In this section, we present two algorithms, and show their

minimax regret guarantee. The ﬁrst is to use a Bernstein

conﬁdence set on transition probabilities, which was ﬁrst

applied to SSP in Cohen et al. (2020) to derive a horizon-

free regret. This algorithm uses a bi-level optimization or-

acle: for the inner layer, an oracle is needed to ﬁnd the

optimal policy inside Πunder a given LMDP; for the outer

layer, an oracle ﬁnds the best transition inside the conﬁ-

dence set which maximizes the optimal expected reward.

The second is to adapt the Monotoic Value Propagation

(MVP) algorithm (Zhang et al.,2021a) to LMDPs. This

algorithm requires an oracle to solve an LMDP with dy-

namic bonus: the bonuses depends on the variances of the

next-step alpha vector. Both algorithms enjoy the following

regret guarantee.

Theorem 2. For both the Bernstein conﬁdence set for

LMDP (Algorithm 1combined with Algorithm 2) and the

Monotonic Value Propagation for LMDP (Algorithm 1

combined with Algorithm 3), with probability at least 1´δ,

we have that

RegretpKq ď r

Op?Var‹MΓSAK `MS2Aq.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

arXiv:2210.11604v3[cs.LG]21May2023Horizon-FreeandVariance-DependentReinforcementLearningforLatentMarkovDecisionProcessesRunlongZhou1RuosongWang1SimonS.Du1AbstractWestudyregretminimizationforreinforcementlearning(RL)inLatentMarkovDecisionPro-cesses(LMDPs)withcontextinhindsight.Wedesignanovelmodel-bas...

展开>> 收起<<

Horizon-Free and Variance-Dependent Reinforcement Learning for Latent Markov Decision Processes.pdf

共26页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Horizon-Free and Variance-Dependent Reinforcement Learning for Latent Markov Decision Processes

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: