Opportunistic Episodic Reinforcement Learning

2025-05-02 0 0 599.07KB 9 页 10玖币
侵权投诉
Opportunistic Episodic Reinforcement Learning
Xiaoxiao Wang 1Nader Bouacida 1Xueying Guo 1Xin Liu 1
Abstract
In this paper, we propose and study opportunis-
tic reinforcement learning - a new variant of re-
inforcement learning problems where the regret
of selecting a suboptimal action varies under an
external environmental condition known as the
variation factor. When the variation factor is
low, so is the regret of selecting a suboptimal
action and vice versa. Our intuition is to exploit
more when the variation factor is high, and ex-
plore more when the variation factor is low. We
demonstrate the benefit of this novel framework
for finite-horizon episodic MDPs by designing
and evaluating OppUCRL2 and OppPSRL algo-
rithms. Our algorithms dynamically balance the
exploration-exploitation trade-off for reinforce-
ment learning by introducing variation factor-
dependent optimism to guide exploration. We
establish an
˜
O(HSAT )
regret bound for the
OppUCRL2 algorithm and show through simu-
lations that both OppUCRL2 and OppPSRL al-
gorithm outperform their original corresponding
algorithms.
1. Introduction
Recently, reinforcement learning (RL) has shown spectac-
ular success. By experimenting, computers can learn how
to autonomously perform tasks that no programmer could
teach them. However, the performance of these approaches
significantly depends on the application domains. In gen-
eral, we need reinforcement learning algorithms with both
good empirical performance and strong theoretical guaran-
tees. This goal cannot be achieved without the efficient
exploration of the environment, which has been studied in
episodic RL.
In episodic RL, an agent interacts with the environment in a
series of episodes while tries to maximize total reward accu-
mulated over time (Burnetas & Katehakis, 1997; Sutton &
Barto, 1998). This learning process leads to a fundamental
1University of California, Davis, CA.
trade-off: Shall the agent explore insufficiently-understood
states and actions to gain new knowledge resulting in better
long-term performance, or exploit its existing information
to maximize short-run rewards? The existing algorithms
focus on how to balance such a trade-off appropriately under
the implicit assumption that the exploration cost remains
the same over time. However, in a variety of application
scenarios, the exploration cost is time-varying and situation-
dependent. Such scenarios can provide us an opportunity to
explore more when the exploration cost is relatively low and
exploit more when that cost is high, thus adaptively balanc-
ing the exploration-exploitation trade-off to achieve better
performance. Consider the following motivating examples.
Motivating scenario 1: return variation in game.
In a
game or a gambling machine where, in some rounds, play-
ers may attain special multipliers (
2×
,
4×
, ..., etc) on their
reward. They can win a large number of points by getting
lucky and having large prizes supplemented by large multi-
pliers. Hence, when the player is given a large multiplier, he
would better play the move that he believes is the best. Such
a conservative move is less risky, especially that we already
know that taking a “bad” action in this situation will result in
a significant loss. On the other hand, in a game round with
no multiplier or a small one, playing an experimental action
will be less risky, since the regret of trying a suboptimal
move, in this case, will be lower.
Motivating scenario 2: value variation in sequential rec-
ommendations.
For sequential recommender system in
e-commerce, the system successively suggests candidate
products for users to maximize the total click-through rate
(i.e., the probability that a user accepts the recommendation)
based on users’ preferences and browser history. We note
that the real monetary return of a recommendation (if ac-
cepted) can differ depending on other factors, such as users
with different levels of purchasing power or loyalty (e.g.,
diamond vs. silver status). Because the ultimate goal is to
maximize the overall monetary reward, intuitively, when the
monetary return of a recommendation (if accepted) is low,
the monetary regret of suggesting suboptimal products is
low, leading to a low exploration cost, and correspondingly,
high returns lead to high regret and high exploration cost.
arXiv:2210.13504v1 [cs.LG] 24 Oct 2022
Opportunistic Episodic Reinforcement Learning
Opportunistic reinforcement learning.
Motivated by
these examples, we propose and study opportunistic
episodic reinforcement learning, a new paradigm of rein-
forcement learning problems where the regret of executing
a suboptimal action depends on a varying cost referred to as
variation factor, associated with the environmental condi-
tions. When the variation factor is low, so is the cost/regret
of picking a suboptimal action and vice versa. Therefore,
intuitively, we should explore more when the variation fac-
tor is low and exploit more when the variation factor is high.
As its name suggests, in opportunistic RL, we leverage the
opportunities of variation factor’s dynamics to reduce regret.
Contributions.
In this work, we propose OppUCRL2 al-
gorithm for opportunistic learning in episodic RL that intro-
duces variation factor-awareness to balance the exploration-
exploitation trade-off. The OppUCRL2 can significantly
outperforms the UCRL2 (Jaksch et al., 2010) in the sim-
ulation and have same theoretical guarantee with respect
to the regret. The opportunistic RL concept is also easy to
generalize for other reinforcement algorithms. To demon-
strate it, we design OppPSRL algorithm based on PSRL (Ian
et al., 2013). It also achieves better performance compared
with the original version in the simulation. To the best of
our knowledge, this is the first work proposing and study-
ing the concept of the opportunistic reinforcement learning.
We believe this work will serve as a foundation for the op-
portunistic reinforcement learning concept and help further
addressing the exploration-exploitation trade-off.
2. Related Work
Optimism in the face of uncertainty (OFU) is a popular
paradigm for the exploration-exploitation trade-off in RL,
where each pair of states and actions is offered some opti-
mism bonus. The agent then chooses a policy that is optimal
under the “optimistic” model of the environment. To learn
efficiently, it maintains some control over its uncertainty by
assigning a more substantial optimistic bonus to potentially
informative states and actions. The assigned bonus can
stimulate and guide the exploration process. Most OFU al-
gorithms provided strong theoretical guarantees (Azar et al.,
2017; Bartlett & Tewari, 2009; Jaksch et al., 2010; Dann
& Brunskill, 2015; Strehl et al., 2009). A popular competi-
tor to OFU algorithms is inspired by Thompson sampling
(TS) (Chapelle & Li, 2011). In RL, TS approaches (Strens,
2000) maintain a posterior distribution over the reward func-
tion and the transition kernel, then compute the optimal
policy for a random sampled MDP from the posterior. One
of the well-known TS algorithms in the literature is Pos-
terior Sampling for Reinforcement Learning (PSRL) (Ian
et al., 2013; Osband & Van Roy, 2017).
The opportunistic learning idea has been introduced in (Wu
et al., 2018) for classic
K
-armed bandits and in (Guo et al.,
2019) for context bandits. In the reinforcement learning,
the authors in (Dann et al., 2019) consider the case where
the each episode has a side context and propose ORLC al-
gorithm that can use the context information to estimate
the dynamic of the environment but not include the oppor-
tunistic concept, which is different from us. To the best
of our knowledge, no prior work has made formal mathe-
matical formulation and rigorous performance analysis for
opportunistic reinforcement learning.
3. Problem Formulation
We consider an RL problem in an episodic finite-horizon
Markov decision process (MDP),
M:= hS,A, H, P, ri
,
where
S
is a finite state space with carnality
|S| =S
,
A
is a
finite action space with carnality
|A| =A
,
H
is the horizon
that represents the number of time steps in each episode,
P
is a state transition distribution such that
P(·|s, a)
dictates a
distribution over state
S
if action
a
is taken for state
s
, and
r:S ×A → [0,1]
is the deterministic reward function. For
simplicity, we assume that the reward function
r
is known
to the agent but the transition distribution Pis unknown.
In each episode of this MDP, an initial state
s1∈ S
is chosen
arbitrarily by the environment before it starts. For each step
h[H]1
, the agent observes a state
sh∈ S
, selects an
action
ah∈ A
, receives a reward
r(sh, ah)
and then the
state transits to next state
sh+1 ∈ S
that is drawn from the
distribution P(·|sh, ah). The episode ends in state sH+1.
A policy for an agent during the episode is expressed as a
mapping
π:S × [H]→ A
. We write
Vπ
h:S R
as the
value function at step
h
under policy
π
. For a state
s∈ S
,
Vπ
h(s)
is the expected return (i.e., sum of rewards) received
under policy
π
, starting from
s=sh∈ S
, i.e.,
Vπ
h(s) :=
EhPH
i=hr(si, π(si, i))
sh=si
. Because the action space,
state space, and horizon are finite, and the reward function
is deterministic, there always exits an optimal policy
π
that
attains the best value
V
h(s) = supπVπ
h(s)
for all
s∈ S
and
h[H]
. For an episode with initial state
s1
, the quality of
a policy
π
is measured by the regret that is the gap between
the value function at step
1
under policy
π
and that under
optimal policy, i.e.,
V
1(s1)Vπ
1(s1)
. The goal of the
classic RL problem is to consider a RL agent interacts with
the environment (MDP
M
) for
K
episodes
k[K]
in a
sequential manner and find the optimal policy.
Next we introduce the
opportunistic reinforcement learn-
ing
in an episodic finite-horizon MDP. For each episode
k[K]
, let
Lk0
be an external
variation factor
and
not change during the episode. We assume
Lk
is indepen-
dent of the MDP
M
for
k[K]
. To distinguish different
1We write [n]for iN,1in
摘要:

OpportunisticEpisodicReinforcementLearningXiaoxiaoWang1NaderBouacida1XueyingGuo1XinLiu1AbstractInthispaper,weproposeandstudyopportunis-ticreinforcementlearning-anewvariantofre-inforcementlearningproblemswheretheregretofselectingasuboptimalactionvariesunderanexternalenvironmentalconditionknownastheva...

展开>> 收起<<
Opportunistic Episodic Reinforcement Learning.pdf

共9页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:9 页 大小:599.07KB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 9
客服
关注