
Rieder 2011). However, the problem of learning the optimal policy of a finite-horizon unknown
CTMDP has not been studied in the literature, to our best knowledge. The goal of this paper is to
fill this gap by developing a learning algorithm for episodic RL in tabular CTMDPs with rigorous
worst-case regret guarantees, as well as providing a worst-case regret lower bound. Here, the regret
measures the difference between the reward collected by the algorithm during learning and the
reward of an optimal policy should the model parameters be completely known.
At the outset, one may think learning CTMDPs is just a straightforward extension of its discrete-
time counterpart in terms of theoretical analysis and algorithm design and thus not worthy of a
separate study. This is not the case. The reason is because a CTMDP is characteristically different
from a DTMDP in that, after an agent observes a state xand takes an action a, the process remains
in this state for a random holding time period that follows an exponential distribution with rate
parameter λ(x, a). Then the process jumps to another state ywith a transition probability p(y|x, a).
For DTMDPs, the holding times (i.e., inter-transition times) are deterministic and take value of one
time step in the general framework of semi-Markov decision processes (Puterman 2014, Chapter
11). This difference, along with the continuity of the time parameter, leads to several subtle
yet substantial challenges in the design and analysis of episodic RL algorithms for CTMDPs. In
particular, existing derivations of state-of-the-art worst-case regret bounds for learning tabular
DTMDPs (see, e.g., Azar et al. 2017, Zanette and Brunskill 2019) often rely on induction/recursion
in the time parameter, which naturally fails for continuous-time problems.
In this paper, we develop a learning algorithm for CTMDPs with unknown transition proba-
bilities and holding time rates, based on two key methodological ingredients. First, we build on
the value iteration algorithms of Mamer (1986), Huang and Guo (2011) for solving finite-horizon
CTMDPs when all the model parameters are known. This type of algorithms are based on an
operator with a fixed point corresponding to the solution to the dynamic programming equation.
A major benefit of this approach is that it naturally introduces a (discrete) number of iterations
that can be used for induction arguments in regret analysis. Second, we take the idea of “optimism
under uncertainty”, i.e., one acts as if the environment is as nice as plausibly possible (Lattimore
and Szepesv´ari 2020). This is a popular paradigm to address the exploration–exploitation trade-off
in RL. To ensure optimism for the purpose of getting a reasonably good upper bound of the true
value function, we carefully design the exploration bonus needed for efficient learning and regret
analysis.
Because we are dealing with Markov chains (instead of, say, diffusion processes with continuous
state space), it is only natural and indeed unavoidable that certain components of our analysis
are inspired by those for discrete-time Markov chains, in particular the UCRL2 (Upper Confidence
Reinforcement Learning) algorithm of Jaksch et al. (2010) and the UCBVI (Upper Confidence
Bound Value Iteration) algorithm of Azar et al. (2017), both originally developed for RL in tabular
DTMDPs. At the beginning of each episode, our algorithm estimates the transition probabilities
and the holding time rate using the past data/samples. Based on the level of uncertainty in these
estimates, we carefully design an exploration bonus for each state–action pair. We then update
the policy by applying a (modified) value iteration to the estimated CTMDP with the reward
augmented by the bonus.
Our main result, Theorem 1, indicates that the proposed RL algorithm has a worst case expected
regret of ˜
O(√K) where Kis the number of episodes and ˜
Oignores logarithmic factors and hides
2