
Opportunistic Episodic Reinforcement Learning
Opportunistic reinforcement learning.
Motivated by
these examples, we propose and study opportunistic
episodic reinforcement learning, a new paradigm of rein-
forcement learning problems where the regret of executing
a suboptimal action depends on a varying cost referred to as
variation factor, associated with the environmental condi-
tions. When the variation factor is low, so is the cost/regret
of picking a suboptimal action and vice versa. Therefore,
intuitively, we should explore more when the variation fac-
tor is low and exploit more when the variation factor is high.
As its name suggests, in opportunistic RL, we leverage the
opportunities of variation factor’s dynamics to reduce regret.
Contributions.
In this work, we propose OppUCRL2 al-
gorithm for opportunistic learning in episodic RL that intro-
duces variation factor-awareness to balance the exploration-
exploitation trade-off. The OppUCRL2 can significantly
outperforms the UCRL2 (Jaksch et al., 2010) in the sim-
ulation and have same theoretical guarantee with respect
to the regret. The opportunistic RL concept is also easy to
generalize for other reinforcement algorithms. To demon-
strate it, we design OppPSRL algorithm based on PSRL (Ian
et al., 2013). It also achieves better performance compared
with the original version in the simulation. To the best of
our knowledge, this is the first work proposing and study-
ing the concept of the opportunistic reinforcement learning.
We believe this work will serve as a foundation for the op-
portunistic reinforcement learning concept and help further
addressing the exploration-exploitation trade-off.
2. Related Work
Optimism in the face of uncertainty (OFU) is a popular
paradigm for the exploration-exploitation trade-off in RL,
where each pair of states and actions is offered some opti-
mism bonus. The agent then chooses a policy that is optimal
under the “optimistic” model of the environment. To learn
efficiently, it maintains some control over its uncertainty by
assigning a more substantial optimistic bonus to potentially
informative states and actions. The assigned bonus can
stimulate and guide the exploration process. Most OFU al-
gorithms provided strong theoretical guarantees (Azar et al.,
2017; Bartlett & Tewari, 2009; Jaksch et al., 2010; Dann
& Brunskill, 2015; Strehl et al., 2009). A popular competi-
tor to OFU algorithms is inspired by Thompson sampling
(TS) (Chapelle & Li, 2011). In RL, TS approaches (Strens,
2000) maintain a posterior distribution over the reward func-
tion and the transition kernel, then compute the optimal
policy for a random sampled MDP from the posterior. One
of the well-known TS algorithms in the literature is Pos-
terior Sampling for Reinforcement Learning (PSRL) (Ian
et al., 2013; Osband & Van Roy, 2017).
The opportunistic learning idea has been introduced in (Wu
et al., 2018) for classic
K
-armed bandits and in (Guo et al.,
2019) for context bandits. In the reinforcement learning,
the authors in (Dann et al., 2019) consider the case where
the each episode has a side context and propose ORLC al-
gorithm that can use the context information to estimate
the dynamic of the environment but not include the oppor-
tunistic concept, which is different from us. To the best
of our knowledge, no prior work has made formal mathe-
matical formulation and rigorous performance analysis for
opportunistic reinforcement learning.
3. Problem Formulation
We consider an RL problem in an episodic finite-horizon
Markov decision process (MDP),
M:= hS,A, H, P, ri
,
where
S
is a finite state space with carnality
|S| =S
,
A
is a
finite action space with carnality
|A| =A
,
H
is the horizon
that represents the number of time steps in each episode,
P
is a state transition distribution such that
P(·|s, a)
dictates a
distribution over state
S
if action
a
is taken for state
s
, and
r:S ×A → [0,1]
is the deterministic reward function. For
simplicity, we assume that the reward function
r
is known
to the agent but the transition distribution Pis unknown.
In each episode of this MDP, an initial state
s1∈ S
is chosen
arbitrarily by the environment before it starts. For each step
h∈[H]1
, the agent observes a state
sh∈ S
, selects an
action
ah∈ A
, receives a reward
r(sh, ah)
and then the
state transits to next state
sh+1 ∈ S
that is drawn from the
distribution P(·|sh, ah). The episode ends in state sH+1.
A policy for an agent during the episode is expressed as a
mapping
π:S × [H]→ A
. We write
Vπ
h:S → R
as the
value function at step
h
under policy
π
. For a state
s∈ S
,
Vπ
h(s)
is the expected return (i.e., sum of rewards) received
under policy
π
, starting from
s=sh∈ S
, i.e.,
Vπ
h(s) :=
EhPH
i=hr(si, π(si, i))
sh=si
. Because the action space,
state space, and horizon are finite, and the reward function
is deterministic, there always exits an optimal policy
π∗
that
attains the best value
V∗
h(s) = supπVπ
h(s)
for all
s∈ S
and
h∈[H]
. For an episode with initial state
s1
, the quality of
a policy
π
is measured by the regret that is the gap between
the value function at step
1
under policy
π
and that under
optimal policy, i.e.,
V∗
1(s1)−Vπ
1(s1)
. The goal of the
classic RL problem is to consider a RL agent interacts with
the environment (MDP
M
) for
K
episodes
k∈[K]
in a
sequential manner and find the optimal policy.
Next we introduce the
opportunistic reinforcement learn-
ing
in an episodic finite-horizon MDP. For each episode
k∈[K]
, let
Lk≥0
be an external
variation factor
and
not change during the episode. We assume
Lk
is indepen-
dent of the MDP
M
for
k∈[K]
. To distinguish different
1We write [n]for i∈N,1≤i≤n