Square-root regret bounds for continuous-time episodic Markov decision processes Xuefeng Gao1 Xun Yu Zhou2

2025-05-03 0 0 686.7KB 34 页 10玖币

侵权投诉

Square-root regret bounds for continuous-time episodic Markov

decision processes

Xuefeng Gao 1, Xun Yu Zhou 2

October 4, 2023

Abstract

We study reinforcement learning for continuous-time Markov decision processes (MDPs) in

the ﬁnite-horizon episodic setting. In contrast to discrete-time MDPs, the inter-transition times

of a continuous-time MDP are exponentially distributed with rate parameters depending on the

state–action pair at each transition. We present a learning algorithm based on the methods

of value iteration and upper conﬁdence bound. We derive an upper bound on the worst-case

expected regret for the proposed algorithm, and establish a worst-case lower bound, with both

bounds of the order of square-root on the number of episodes. Finally, we conduct simulation

experiments to illustrate the performance of our algorithm.

1 Introduction

Reinforcement learning (RL) studies the problem of sequential decision making in an unknown

environment by carefully balancing between exploration (learning) and exploitation (optimizing)

(Sutton and Barto 2018). While the RL study has a relatively long history, it has received consid-

erable attention in the past decades due to the explosion of available data and rapid improvement

of computing power. A hitherto default mathematical framework for RL is Markov decision process

(MDP), where the agent does not know the transition probabilities and can observe a reward result-

ing from an action but does not know the reward function itself. There has been extensive research

on RL for discrete-time MDPs (DTMDPs); see, e.g., Jaksch et al. (2010), Osband and Van Roy

(2017), Azar et al. (2017), Jin et al. (2018). However, much less attention has been paid to RL for

continuous-time MDPs, whereas there are many real-world applications where one needs to interact

with the unknown environment and learn the optimal strategies continuously in time. Examples

include autonomous driving, control of queueing systems, control of infectious diseases, preventive

maintenance and robot navigation; see, e.g., Guo and Hern´andez-Lerma (2009), Piunovskiy and

Zhang (2020), Chapter 11 of Puterman (2014) and the references therein.

In this paper we study RL for tabular continuous-time Markov decision processes (CTMDPs)

in the ﬁnite-horizon, episodic setting, where an agent interacts with the unknown environment in

episodes of a ﬁxed length with ﬁnite state and action spaces. The study of model-based (i.e. the

underlying models are assumed to be known) ﬁnite-horizon CTMDPs has a very long history, proba-

bly dating back to Miller (1968), with vast applications including queueing optimization (Lippman

1976), dynamic pricing (Gallego and Van Ryzin 1994), and ﬁnance and insurance (B¨auerle and

1Department of Systems Engineering and Engineering Management, The Chinese University of Hong Kong, Hong

Kong, China. Email: xfgao@se.cuhk.edu.hk.

2Department of Industrial Engineering and Operations Research and The Data Science Institute, Columbia Uni-

versity, New York, NY 10027, USA. Email: xz2574@columbia.edu.

arXiv:2210.00832v2 [cs.LG] 3 Oct 2023

Rieder 2011). However, the problem of learning the optimal policy of a ﬁnite-horizon unknown

CTMDP has not been studied in the literature, to our best knowledge. The goal of this paper is to

ﬁll this gap by developing a learning algorithm for episodic RL in tabular CTMDPs with rigorous

worst-case regret guarantees, as well as providing a worst-case regret lower bound. Here, the regret

measures the diﬀerence between the reward collected by the algorithm during learning and the

reward of an optimal policy should the model parameters be completely known.

At the outset, one may think learning CTMDPs is just a straightforward extension of its discrete-

time counterpart in terms of theoretical analysis and algorithm design and thus not worthy of a

separate study. This is not the case. The reason is because a CTMDP is characteristically diﬀerent

from a DTMDP in that, after an agent observes a state xand takes an action a, the process remains

in this state for a random holding time period that follows an exponential distribution with rate

parameter λ(x, a). Then the process jumps to another state ywith a transition probability p(y|x, a).

For DTMDPs, the holding times (i.e., inter-transition times) are deterministic and take value of one

time step in the general framework of semi-Markov decision processes (Puterman 2014, Chapter

11). This diﬀerence, along with the continuity of the time parameter, leads to several subtle

yet substantial challenges in the design and analysis of episodic RL algorithms for CTMDPs. In

particular, existing derivations of state-of-the-art worst-case regret bounds for learning tabular

DTMDPs (see, e.g., Azar et al. 2017, Zanette and Brunskill 2019) often rely on induction/recursion

in the time parameter, which naturally fails for continuous-time problems.

In this paper, we develop a learning algorithm for CTMDPs with unknown transition proba-

bilities and holding time rates, based on two key methodological ingredients. First, we build on

the value iteration algorithms of Mamer (1986), Huang and Guo (2011) for solving ﬁnite-horizon

CTMDPs when all the model parameters are known. This type of algorithms are based on an

operator with a ﬁxed point corresponding to the solution to the dynamic programming equation.

A major beneﬁt of this approach is that it naturally introduces a (discrete) number of iterations

that can be used for induction arguments in regret analysis. Second, we take the idea of “optimism

under uncertainty”, i.e., one acts as if the environment is as nice as plausibly possible (Lattimore

and Szepesv´ari 2020). This is a popular paradigm to address the exploration–exploitation trade-oﬀ

in RL. To ensure optimism for the purpose of getting a reasonably good upper bound of the true

value function, we carefully design the exploration bonus needed for eﬃcient learning and regret

analysis.

Because we are dealing with Markov chains (instead of, say, diﬀusion processes with continuous

state space), it is only natural and indeed unavoidable that certain components of our analysis

are inspired by those for discrete-time Markov chains, in particular the UCRL2 (Upper Conﬁdence

Reinforcement Learning) algorithm of Jaksch et al. (2010) and the UCBVI (Upper Conﬁdence

Bound Value Iteration) algorithm of Azar et al. (2017), both originally developed for RL in tabular

DTMDPs. At the beginning of each episode, our algorithm estimates the transition probabilities

and the holding time rate using the past data/samples. Based on the level of uncertainty in these

estimates, we carefully design an exploration bonus for each state–action pair. We then update

the policy by applying a (modiﬁed) value iteration to the estimated CTMDP with the reward

augmented by the bonus.

Our main result, Theorem 1, indicates that the proposed RL algorithm has a worst case expected

regret of ˜

O(√K) where Kis the number of episodes and ˜

Oignores logarithmic factors and hides

dependency on other constants. Moreover, Theorem 2 provides a lower bound showing that the

square-root dependance of the worst case regret upper bound on Kis optimal (up to a logarithmic

factor). Furthermore, we conduct simulation experiments to illustrate the performance of our

algorithm.

While our algorithm bears certain similarity to UCBVI of Azar et al. (2017) conceptually, there

are signiﬁcant diﬀerences in our algorithm design and regret analysis compared with the discrete

time setting, which we elaborate below.

(a) First, in contrast with learning DTMDPs, we need to estimate the holding time rate λ(x, a)

of a CTMDP for each state–action pair (x, a) in our RL algorithm. In the regret analysis, we

also need to analyze the estimation error of λ(x, a) by establishing a ﬁnite sample conﬁdence

bound. Note that in the episodic setting, the random exponential holding time associated

with each state–action pair may be truncated by the endpoint of the time horizon. This

generates possibly non i.i.d. observations of holding times within the same episode. Hence,

we can not directly apply concentration inequalities for i.i.d. exponential distributions to

obtain the conﬁdence bound. The key idea to overcome this diﬃculty is to ‘piece together’

the samples of exponential holding times at a pair (x, a) as realizations of inter-arrival times

of a Poisson process with rate λ(x, a) modelling the number of visits to the pair over time in

an episode. Even if the time is eventually truncated, the number of visits to the pair (x, a)

(more precisely, those visits where the next states are observed) in an episode still follows

a Poisson distribution conditioned on the total amount of time staying at (x, a). Since the

holding times at (x, a) across diﬀerent episodes are independent, we can then apply the tail

bounds for Poisson distribution to derive the desired conﬁdence bound for the holding time

rate λ(x, a). Such an analysis is not required for learning DTMDPs.

(b) Second, because the value function of a CTMDP depends on both the transition probabilities

and the holding time rates, the design of the exploration bonus in our algorithm, which is

used to bound estimation errors on the value function, is more sophisticated compared with

the discrete time setting. Speciﬁcaly, we need to carefully analyze how the estimation errors

of holding time rates and transition probabilities jointly aﬀect the computed value function

in our learning algorithm. Moreover, because the “Bellman operator” in our continuous-

time setting (see the operator Tain Equation (6)) is more complicated due to the presence of

exponential holding times, the error analysis of the learned model also become more involved.

oracle. In constast to solving ﬁnite-horizon DTMDPs, the value iteration algorithm for solving

ﬁnite-horizon CTMDPs with known parameters converges generally in inﬁnite number of

iterations. This leads to important diﬀerence in the regret analysis, in particular in the

proof of optimism (see Lemma 5). In the study of learning DTMDPs, the optimism can be

proved by backward induction on the time parameter. In our setting of episodic CTMDPs,

however, induction is not enough to establish optimism. Extra analysis of (modiﬁed) value

iterations for CTMDPs (Algorithm 2) is required, because Algorithm 2 is stopped after a

ﬁnite number of iterations and we need to quantify the error. In this paper, we show that

for ﬁnite-horizon CTMDPs with bounded transition rates, the (modiﬁed) value iteration

converges asymptotically with a linear rate ρ∈(0,1) when the number of iterations approach

inﬁnity. This is achieved by showing that the Bellman operator for the empirical CTMDP with

reward function augmented by the bonus (see the operator in Equation (15)) is a contraction

mapping with the coeﬃcient ρ. Such a technical analysis is not required for the classical

UCBVI algorithm in the discrete-time setting.

(d) Fourth, in contrast to the analysis of UCBVI for DTMDPs, the standard pigeon-hole argument

(see, e.g., Azar et al. (2017) or Lemma 7.5 in Chapter 7 of Agarwal et al. 2021), which is used

to bound the sum of bonuses evaluated on data to obtain sublinear regret, can not be applied

to our continuous time setting. Our exploration bonus involves both the estimation errors of

transition probabilities and holding time rates. The estimation error of transition probabilities

of a CTMDP is not inversely proportional to the square root of the total visit counts at each

state-action pair. This is because after visiting a state–action pair, the random holding time

may be truncated by the endpoint of the horizon, and hence the next transition/state may

not be observed. This subtle diﬀerence, compared with the discrete-time setting, leads to

substantial challenges in our regret analysis. In particular, it is possible that the sum of

estimation errors of transition probabilities evaluated on data is linear in Kfor some sample

paths (see (41)). This shows that the standard pathwise pigeon-hole argument can not be

directly used for obtaining sublinear regret in learning CTMDPs. In our paper, we develop

a new probabilistic method to overcome such a diﬃculty; see Proposition 1 and its proof.

On the other hand, the estimation error in the holding time rate of a CTMDP is inversely

proportional to the square root of the total time spent at each state after an action, and this

total spending time is real-valued instead of integer-valued. Hence, the pigeon-hole principle

as a type of counting argument does not apply. We overcome this diﬃculty by relating the

total spending time with the visit counts for each state–action pair (see Proposition 2 and its

proof).

(e) Finally, the transitions of a ﬁnite-horizon CTMDP do not happen according to a Poisson

process in general, because the rates of exponential holding times λ(·,·) depend on the state–

action pairs. As a consequence, while the horizon length of each episode is ﬁxed, the number

of transitions in diﬀerent episodes are random variables that are generally unbounded and

statistically diﬀerent from each other. This is in sharp contrast with regret minimization

in episodic DTMDPs, where the number of decision steps is often ﬁxed and common across

diﬀerent episodes. In our regret analysis, we overcome this diﬃculty by ‘uniformizing’ the

total number of decision steps in each episode for CTMDPs with bounded transition rates,

and then applying bounds for the maximum of Kindependent Poisson distributions with the

same rate.

This paper is related to several studies on RL for CTMDPs in the inﬁnite-horizon setting.

Computational RL methods were developed for inﬁnite-horizon CTMDPs as well as for the more

general semi-Markov decision processes (SMDPs) very early on (Bradtke and Duﬀ 1995, Das et al.

1999). However, available theoretical results on regret bounds for learning inﬁnite-horizon CTMDPs

are few and far between. Fruit and Lazaric (2017) study learning in continuous-time inﬁnite-

horizon average-reward SMDPs which is more general than CTMDPs in that the holding times

can follow general distributions. They adapt the UCRL2 algorithm by Jaksch et al. (2010) and

show that their algorithm achieves ˜

O(√n) worst-case regret after ndecision steps. Instead of worst-

case (instance-independent) performance bounds, Gao and Zhou (2022) recently establish instance-

dependent regret bounds that are logarithmic in time for learning tabular CTMDPs, again in the

inﬁnite-horizon average-reward setting. It is well recognized that the ﬁnite-horizon continuous-time

decision problem is diﬀerent from and indeed generally harder than its inﬁnite-horizon counterpart

in a number of aspects, including the need of considering the time variable in the value function

and the truncation of the holding time discussed earlier. As such, our paper naturally diﬀers from

those on inﬁnite-horizon problems.

We conclude this introduction by mentioning a growing body of literature on RL in continuous

time with possibly continuous state and/or action spaces. Recently, Wang et al. (2020), Jia and

Zhou (2022a,b, 2023) consider RL for diﬀusion processes and study respectively the problems of

generating trial-and-error policies strategically, police evaluation, policy gradient and q-learning.

However, the regret analysis remains largely untouched in that series of study. Basei et al. (2021),

Guo et al. (2021) and Szpruch et al. (2021) study continuous-time RL for linear quadratic/convex

models with continuous state spaces and propose algorithms with sublinear regret bounds in the

ﬁnite-horizon episodic setting. By contrast, our work considers episodic RL for CTMDPs with a

discrete state space.

The remainder of the paper is organized as follows. In Section 2, we formulate the problem of

episodic RL for CTMDPs. In Section 3 we introduce our learning algorithm, while in Section 4 we

present the main results on regret bounds. Section 5 reports the result of simulation experiments

and Section 6 contains the proofs of the main results. Finally Section 7 concludes.

2 Episodic RL in a Tabular CTMDP

We consider a continuous-time Markov decision process (CTMDP) with a ﬁnite state space S, a

ﬁnite action space Aand a ﬁnite time horizon [0, H]. The CTMDP evolves as follows (Puterman

2014, Chapter 11). At the initial time 0,the process is in state x0∈ S and an agent chooses an

action a0∈ A. The process remains in state x0for a random holding time period τ0that follows

an exponential distribution with rate parameter λ(x0, a0),while rewards are continuously accrued

at a rate r(x0, a0) during the holding period τ0. Then the process jumps to another state x1∈ S

at time τ0with a transition probability p(x1|x0, a0), and another action a1is made upon landing

on x1. This series of events is repeated until the end of the horizon, H. It is immediate to see that

if action ais chosen in state x, then the joint probability that the holding time in state xis not

greater than tand the next state is yis given by 1−e−λ(x,a)t·p(y|x, a).

For (x, a)∈ S × A, deﬁne the (controlled) transition rates

q(y|x, a) := λ(x, a)·p(y|x, a)≥0,for y∈ S, y ̸=x,

and set q(x|x, a) = −λ(x, a)≤0 so that Pz∈S q(z|x, a)≡0.Mathematically, a ﬁnite-horizon

CTMDP model Mis characterized by the following set:

M:= (S,A, r(·,·), q(·|·,·), H, x0),(1)

where r(·,·) is the reward function, q(·|·,·) is the transition rate, H < ∞is the length of the

horizon and x0is the initial state of the CTMDP model. Note that q(·|·,·) and (λ(·,·), p(·|·,·)) are

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Square-rootregretboundsforcontinuous-timeepisodicMarkovdecisionprocessesXuefengGao1,XunYuZhou2October4,2023AbstractWestudyreinforcementlearningforcontinuous-timeMarkovdecisionprocesses(MDPs)inthefinite-horizonepisodicsetting.Incontrasttodiscrete-timeMDPs,theinter-transitiontimesofacontinuous-timeMDP...

展开>> 收起<<

Square-root regret bounds for continuous-time episodic Markov decision processes Xuefeng Gao1 Xun Yu Zhou2.pdf

共34页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Square-root regret bounds for continuous-time episodic Markov decision processes Xuefeng Gao1 Xun Yu Zhou2

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: