Square-root regret bounds for continuous-time episodic Markov decision processes Xuefeng Gao1 Xun Yu Zhou2

2025-05-03 0 0 686.7KB 34 页 10玖币
侵权投诉
Square-root regret bounds for continuous-time episodic Markov
decision processes
Xuefeng Gao 1, Xun Yu Zhou 2
October 4, 2023
Abstract
We study reinforcement learning for continuous-time Markov decision processes (MDPs) in
the finite-horizon episodic setting. In contrast to discrete-time MDPs, the inter-transition times
of a continuous-time MDP are exponentially distributed with rate parameters depending on the
state–action pair at each transition. We present a learning algorithm based on the methods
of value iteration and upper confidence bound. We derive an upper bound on the worst-case
expected regret for the proposed algorithm, and establish a worst-case lower bound, with both
bounds of the order of square-root on the number of episodes. Finally, we conduct simulation
experiments to illustrate the performance of our algorithm.
1 Introduction
Reinforcement learning (RL) studies the problem of sequential decision making in an unknown
environment by carefully balancing between exploration (learning) and exploitation (optimizing)
(Sutton and Barto 2018). While the RL study has a relatively long history, it has received consid-
erable attention in the past decades due to the explosion of available data and rapid improvement
of computing power. A hitherto default mathematical framework for RL is Markov decision process
(MDP), where the agent does not know the transition probabilities and can observe a reward result-
ing from an action but does not know the reward function itself. There has been extensive research
on RL for discrete-time MDPs (DTMDPs); see, e.g., Jaksch et al. (2010), Osband and Van Roy
(2017), Azar et al. (2017), Jin et al. (2018). However, much less attention has been paid to RL for
continuous-time MDPs, whereas there are many real-world applications where one needs to interact
with the unknown environment and learn the optimal strategies continuously in time. Examples
include autonomous driving, control of queueing systems, control of infectious diseases, preventive
maintenance and robot navigation; see, e.g., Guo and Hern´andez-Lerma (2009), Piunovskiy and
Zhang (2020), Chapter 11 of Puterman (2014) and the references therein.
In this paper we study RL for tabular continuous-time Markov decision processes (CTMDPs)
in the finite-horizon, episodic setting, where an agent interacts with the unknown environment in
episodes of a fixed length with finite state and action spaces. The study of model-based (i.e. the
underlying models are assumed to be known) finite-horizon CTMDPs has a very long history, proba-
bly dating back to Miller (1968), with vast applications including queueing optimization (Lippman
1976), dynamic pricing (Gallego and Van Ryzin 1994), and finance and insurance (B¨auerle and
1Department of Systems Engineering and Engineering Management, The Chinese University of Hong Kong, Hong
Kong, China. Email: xfgao@se.cuhk.edu.hk.
2Department of Industrial Engineering and Operations Research and The Data Science Institute, Columbia Uni-
versity, New York, NY 10027, USA. Email: xz2574@columbia.edu.
1
arXiv:2210.00832v2 [cs.LG] 3 Oct 2023
Rieder 2011). However, the problem of learning the optimal policy of a finite-horizon unknown
CTMDP has not been studied in the literature, to our best knowledge. The goal of this paper is to
fill this gap by developing a learning algorithm for episodic RL in tabular CTMDPs with rigorous
worst-case regret guarantees, as well as providing a worst-case regret lower bound. Here, the regret
measures the difference between the reward collected by the algorithm during learning and the
reward of an optimal policy should the model parameters be completely known.
At the outset, one may think learning CTMDPs is just a straightforward extension of its discrete-
time counterpart in terms of theoretical analysis and algorithm design and thus not worthy of a
separate study. This is not the case. The reason is because a CTMDP is characteristically different
from a DTMDP in that, after an agent observes a state xand takes an action a, the process remains
in this state for a random holding time period that follows an exponential distribution with rate
parameter λ(x, a). Then the process jumps to another state ywith a transition probability p(y|x, a).
For DTMDPs, the holding times (i.e., inter-transition times) are deterministic and take value of one
time step in the general framework of semi-Markov decision processes (Puterman 2014, Chapter
11). This difference, along with the continuity of the time parameter, leads to several subtle
yet substantial challenges in the design and analysis of episodic RL algorithms for CTMDPs. In
particular, existing derivations of state-of-the-art worst-case regret bounds for learning tabular
DTMDPs (see, e.g., Azar et al. 2017, Zanette and Brunskill 2019) often rely on induction/recursion
in the time parameter, which naturally fails for continuous-time problems.
In this paper, we develop a learning algorithm for CTMDPs with unknown transition proba-
bilities and holding time rates, based on two key methodological ingredients. First, we build on
the value iteration algorithms of Mamer (1986), Huang and Guo (2011) for solving finite-horizon
CTMDPs when all the model parameters are known. This type of algorithms are based on an
operator with a fixed point corresponding to the solution to the dynamic programming equation.
A major benefit of this approach is that it naturally introduces a (discrete) number of iterations
that can be used for induction arguments in regret analysis. Second, we take the idea of “optimism
under uncertainty”, i.e., one acts as if the environment is as nice as plausibly possible (Lattimore
and Szepesv´ari 2020). This is a popular paradigm to address the exploration–exploitation trade-off
in RL. To ensure optimism for the purpose of getting a reasonably good upper bound of the true
value function, we carefully design the exploration bonus needed for efficient learning and regret
analysis.
Because we are dealing with Markov chains (instead of, say, diffusion processes with continuous
state space), it is only natural and indeed unavoidable that certain components of our analysis
are inspired by those for discrete-time Markov chains, in particular the UCRL2 (Upper Confidence
Reinforcement Learning) algorithm of Jaksch et al. (2010) and the UCBVI (Upper Confidence
Bound Value Iteration) algorithm of Azar et al. (2017), both originally developed for RL in tabular
DTMDPs. At the beginning of each episode, our algorithm estimates the transition probabilities
and the holding time rate using the past data/samples. Based on the level of uncertainty in these
estimates, we carefully design an exploration bonus for each state–action pair. We then update
the policy by applying a (modified) value iteration to the estimated CTMDP with the reward
augmented by the bonus.
Our main result, Theorem 1, indicates that the proposed RL algorithm has a worst case expected
regret of ˜
O(K) where Kis the number of episodes and ˜
Oignores logarithmic factors and hides
2
dependency on other constants. Moreover, Theorem 2 provides a lower bound showing that the
square-root dependance of the worst case regret upper bound on Kis optimal (up to a logarithmic
factor). Furthermore, we conduct simulation experiments to illustrate the performance of our
algorithm.
While our algorithm bears certain similarity to UCBVI of Azar et al. (2017) conceptually, there
are significant differences in our algorithm design and regret analysis compared with the discrete
time setting, which we elaborate below.
(a) First, in contrast with learning DTMDPs, we need to estimate the holding time rate λ(x, a)
of a CTMDP for each state–action pair (x, a) in our RL algorithm. In the regret analysis, we
also need to analyze the estimation error of λ(x, a) by establishing a finite sample confidence
bound. Note that in the episodic setting, the random exponential holding time associated
with each state–action pair may be truncated by the endpoint of the time horizon. This
generates possibly non i.i.d. observations of holding times within the same episode. Hence,
we can not directly apply concentration inequalities for i.i.d. exponential distributions to
obtain the confidence bound. The key idea to overcome this difficulty is to ‘piece together’
the samples of exponential holding times at a pair (x, a) as realizations of inter-arrival times
of a Poisson process with rate λ(x, a) modelling the number of visits to the pair over time in
an episode. Even if the time is eventually truncated, the number of visits to the pair (x, a)
(more precisely, those visits where the next states are observed) in an episode still follows
a Poisson distribution conditioned on the total amount of time staying at (x, a). Since the
holding times at (x, a) across different episodes are independent, we can then apply the tail
bounds for Poisson distribution to derive the desired confidence bound for the holding time
rate λ(x, a). Such an analysis is not required for learning DTMDPs.
(b) Second, because the value function of a CTMDP depends on both the transition probabilities
and the holding time rates, the design of the exploration bonus in our algorithm, which is
used to bound estimation errors on the value function, is more sophisticated compared with
the discrete time setting. Specificaly, we need to carefully analyze how the estimation errors
of holding time rates and transition probabilities jointly affect the computed value function
in our learning algorithm. Moreover, because the “Bellman operator” in our continuous-
time setting (see the operator Tain Equation (6)) is more complicated due to the presence of
exponential holding times, the error analysis of the learned model also become more involved.
(c) Third, our algorithm also differs from the classical UCBVI algorithm in terms of the planning
oracle. In constast to solving finite-horizon DTMDPs, the value iteration algorithm for solving
finite-horizon CTMDPs with known parameters converges generally in infinite number of
iterations. This leads to important difference in the regret analysis, in particular in the
proof of optimism (see Lemma 5). In the study of learning DTMDPs, the optimism can be
proved by backward induction on the time parameter. In our setting of episodic CTMDPs,
however, induction is not enough to establish optimism. Extra analysis of (modified) value
iterations for CTMDPs (Algorithm 2) is required, because Algorithm 2 is stopped after a
finite number of iterations and we need to quantify the error. In this paper, we show that
for finite-horizon CTMDPs with bounded transition rates, the (modified) value iteration
converges asymptotically with a linear rate ρ(0,1) when the number of iterations approach
3
infinity. This is achieved by showing that the Bellman operator for the empirical CTMDP with
reward function augmented by the bonus (see the operator in Equation (15)) is a contraction
mapping with the coefficient ρ. Such a technical analysis is not required for the classical
UCBVI algorithm in the discrete-time setting.
(d) Fourth, in contrast to the analysis of UCBVI for DTMDPs, the standard pigeon-hole argument
(see, e.g., Azar et al. (2017) or Lemma 7.5 in Chapter 7 of Agarwal et al. 2021), which is used
to bound the sum of bonuses evaluated on data to obtain sublinear regret, can not be applied
to our continuous time setting. Our exploration bonus involves both the estimation errors of
transition probabilities and holding time rates. The estimation error of transition probabilities
of a CTMDP is not inversely proportional to the square root of the total visit counts at each
state-action pair. This is because after visiting a state–action pair, the random holding time
may be truncated by the endpoint of the horizon, and hence the next transition/state may
not be observed. This subtle difference, compared with the discrete-time setting, leads to
substantial challenges in our regret analysis. In particular, it is possible that the sum of
estimation errors of transition probabilities evaluated on data is linear in Kfor some sample
paths (see (41)). This shows that the standard pathwise pigeon-hole argument can not be
directly used for obtaining sublinear regret in learning CTMDPs. In our paper, we develop
a new probabilistic method to overcome such a difficulty; see Proposition 1 and its proof.
On the other hand, the estimation error in the holding time rate of a CTMDP is inversely
proportional to the square root of the total time spent at each state after an action, and this
total spending time is real-valued instead of integer-valued. Hence, the pigeon-hole principle
as a type of counting argument does not apply. We overcome this difficulty by relating the
total spending time with the visit counts for each state–action pair (see Proposition 2 and its
proof).
(e) Finally, the transitions of a finite-horizon CTMDP do not happen according to a Poisson
process in general, because the rates of exponential holding times λ(·,·) depend on the state–
action pairs. As a consequence, while the horizon length of each episode is fixed, the number
of transitions in different episodes are random variables that are generally unbounded and
statistically different from each other. This is in sharp contrast with regret minimization
in episodic DTMDPs, where the number of decision steps is often fixed and common across
different episodes. In our regret analysis, we overcome this difficulty by ‘uniformizing’ the
total number of decision steps in each episode for CTMDPs with bounded transition rates,
and then applying bounds for the maximum of Kindependent Poisson distributions with the
same rate.
This paper is related to several studies on RL for CTMDPs in the infinite-horizon setting.
Computational RL methods were developed for infinite-horizon CTMDPs as well as for the more
general semi-Markov decision processes (SMDPs) very early on (Bradtke and Duff 1995, Das et al.
1999). However, available theoretical results on regret bounds for learning infinite-horizon CTMDPs
are few and far between. Fruit and Lazaric (2017) study learning in continuous-time infinite-
horizon average-reward SMDPs which is more general than CTMDPs in that the holding times
can follow general distributions. They adapt the UCRL2 algorithm by Jaksch et al. (2010) and
show that their algorithm achieves ˜
O(n) worst-case regret after ndecision steps. Instead of worst-
case (instance-independent) performance bounds, Gao and Zhou (2022) recently establish instance-
4
dependent regret bounds that are logarithmic in time for learning tabular CTMDPs, again in the
infinite-horizon average-reward setting. It is well recognized that the finite-horizon continuous-time
decision problem is different from and indeed generally harder than its infinite-horizon counterpart
in a number of aspects, including the need of considering the time variable in the value function
and the truncation of the holding time discussed earlier. As such, our paper naturally differs from
those on infinite-horizon problems.
We conclude this introduction by mentioning a growing body of literature on RL in continuous
time with possibly continuous state and/or action spaces. Recently, Wang et al. (2020), Jia and
Zhou (2022a,b, 2023) consider RL for diffusion processes and study respectively the problems of
generating trial-and-error policies strategically, police evaluation, policy gradient and q-learning.
However, the regret analysis remains largely untouched in that series of study. Basei et al. (2021),
Guo et al. (2021) and Szpruch et al. (2021) study continuous-time RL for linear quadratic/convex
models with continuous state spaces and propose algorithms with sublinear regret bounds in the
finite-horizon episodic setting. By contrast, our work considers episodic RL for CTMDPs with a
discrete state space.
The remainder of the paper is organized as follows. In Section 2, we formulate the problem of
episodic RL for CTMDPs. In Section 3 we introduce our learning algorithm, while in Section 4 we
present the main results on regret bounds. Section 5 reports the result of simulation experiments
and Section 6 contains the proofs of the main results. Finally Section 7 concludes.
2 Episodic RL in a Tabular CTMDP
We consider a continuous-time Markov decision process (CTMDP) with a finite state space S, a
finite action space Aand a finite time horizon [0, H]. The CTMDP evolves as follows (Puterman
2014, Chapter 11). At the initial time 0,the process is in state x0∈ S and an agent chooses an
action a0∈ A. The process remains in state x0for a random holding time period τ0that follows
an exponential distribution with rate parameter λ(x0, a0),while rewards are continuously accrued
at a rate r(x0, a0) during the holding period τ0. Then the process jumps to another state x1∈ S
at time τ0with a transition probability p(x1|x0, a0), and another action a1is made upon landing
on x1. This series of events is repeated until the end of the horizon, H. It is immediate to see that
if action ais chosen in state x, then the joint probability that the holding time in state xis not
greater than tand the next state is yis given by 1eλ(x,a)t·p(y|x, a).
For (x, a) S × A, define the (controlled) transition rates
q(y|x, a) := λ(x, a)·p(y|x, a)0,for y∈ S, y ̸=x,
and set q(x|x, a) = λ(x, a)0 so that Pz∈S q(z|x, a)0.Mathematically, a finite-horizon
CTMDP model Mis characterized by the following set:
M:= (S,A, r(·,·), q(·|·,·), H, x0),(1)
where r(·,·) is the reward function, q(·|·,·) is the transition rate, H < is the length of the
horizon and x0is the initial state of the CTMDP model. Note that q(·|·,·) and (λ(·,·), p(·|·,·)) are
5
摘要:

Square-rootregretboundsforcontinuous-timeepisodicMarkovdecisionprocessesXuefengGao1,XunYuZhou2October4,2023AbstractWestudyreinforcementlearningforcontinuous-timeMarkovdecisionprocesses(MDPs)inthefinite-horizonepisodicsetting.Incontrasttodiscrete-timeMDPs,theinter-transitiontimesofacontinuous-timeMDP...

展开>> 收起<<
Square-root regret bounds for continuous-time episodic Markov decision processes Xuefeng Gao1 Xun Yu Zhou2.pdf

共34页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!

相关推荐

分类:图书资源 价格:10玖币 属性:34 页 大小:686.7KB 格式:PDF 时间:2025-05-03

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 34
客服
关注