Opportunistic Episodic Reinforcement Learning

2025-05-02 0 0 599.07KB 9 页 10玖币

侵权投诉

Xiaoxiao Wang 1Nader Bouacida 1Xueying Guo 1Xin Liu 1

Abstract

In this paper, we propose and study opportunis-

tic reinforcement learning - a new variant of re-

inforcement learning problems where the regret

of selecting a suboptimal action varies under an

external environmental condition known as the

variation factor. When the variation factor is

low, so is the regret of selecting a suboptimal

action and vice versa. Our intuition is to exploit

more when the variation factor is high, and ex-

plore more when the variation factor is low. We

demonstrate the beneﬁt of this novel framework

for ﬁnite-horizon episodic MDPs by designing

and evaluating OppUCRL2 and OppPSRL algo-

rithms. Our algorithms dynamically balance the

exploration-exploitation trade-off for reinforce-

ment learning by introducing variation factor-

dependent optimism to guide exploration. We

establish an

O(HS√AT )

regret bound for the

OppUCRL2 algorithm and show through simu-

lations that both OppUCRL2 and OppPSRL al-

gorithm outperform their original corresponding

algorithms.

1. Introduction

Recently, reinforcement learning (RL) has shown spectac-

ular success. By experimenting, computers can learn how

to autonomously perform tasks that no programmer could

teach them. However, the performance of these approaches

signiﬁcantly depends on the application domains. In gen-

eral, we need reinforcement learning algorithms with both

good empirical performance and strong theoretical guaran-

tees. This goal cannot be achieved without the efﬁcient

exploration of the environment, which has been studied in

episodic RL.

In episodic RL, an agent interacts with the environment in a

series of episodes while tries to maximize total reward accu-

mulated over time (Burnetas & Katehakis, 1997; Sutton &

Barto, 1998). This learning process leads to a fundamental

1University of California, Davis, CA.

trade-off: Shall the agent explore insufﬁciently-understood

states and actions to gain new knowledge resulting in better

long-term performance, or exploit its existing information

to maximize short-run rewards? The existing algorithms

focus on how to balance such a trade-off appropriately under

the implicit assumption that the exploration cost remains

the same over time. However, in a variety of application

scenarios, the exploration cost is time-varying and situation-

dependent. Such scenarios can provide us an opportunity to

explore more when the exploration cost is relatively low and

exploit more when that cost is high, thus adaptively balanc-

ing the exploration-exploitation trade-off to achieve better

performance. Consider the following motivating examples.

Motivating scenario 1: return variation in game.

In a

game or a gambling machine where, in some rounds, play-

ers may attain special multipliers (

2×

4×

, ..., etc) on their

reward. They can win a large number of points by getting

lucky and having large prizes supplemented by large multi-

pliers. Hence, when the player is given a large multiplier, he

would better play the move that he believes is the best. Such

a conservative move is less risky, especially that we already

know that taking a “bad” action in this situation will result in

a signiﬁcant loss. On the other hand, in a game round with

no multiplier or a small one, playing an experimental action

will be less risky, since the regret of trying a suboptimal

move, in this case, will be lower.

Motivating scenario 2: value variation in sequential rec-

ommendations.

For sequential recommender system in

e-commerce, the system successively suggests candidate

products for users to maximize the total click-through rate

(i.e., the probability that a user accepts the recommendation)

based on users’ preferences and browser history. We note

that the real monetary return of a recommendation (if ac-

cepted) can differ depending on other factors, such as users

with different levels of purchasing power or loyalty (e.g.,

diamond vs. silver status). Because the ultimate goal is to

maximize the overall monetary reward, intuitively, when the

monetary return of a recommendation (if accepted) is low,

the monetary regret of suggesting suboptimal products is

low, leading to a low exploration cost, and correspondingly,

high returns lead to high regret and high exploration cost.

arXiv:2210.13504v1 [cs.LG] 24 Oct 2022

Opportunistic Episodic Reinforcement Learning

Opportunistic reinforcement learning.

Motivated by

these examples, we propose and study opportunistic

episodic reinforcement learning, a new paradigm of rein-

forcement learning problems where the regret of executing

a suboptimal action depends on a varying cost referred to as

variation factor, associated with the environmental condi-

tions. When the variation factor is low, so is the cost/regret

of picking a suboptimal action and vice versa. Therefore,

intuitively, we should explore more when the variation fac-

tor is low and exploit more when the variation factor is high.

As its name suggests, in opportunistic RL, we leverage the

opportunities of variation factor’s dynamics to reduce regret.

Contributions.

In this work, we propose OppUCRL2 al-

gorithm for opportunistic learning in episodic RL that intro-

duces variation factor-awareness to balance the exploration-

exploitation trade-off. The OppUCRL2 can signiﬁcantly

outperforms the UCRL2 (Jaksch et al., 2010) in the sim-

ulation and have same theoretical guarantee with respect

to the regret. The opportunistic RL concept is also easy to

generalize for other reinforcement algorithms. To demon-

strate it, we design OppPSRL algorithm based on PSRL (Ian

et al., 2013). It also achieves better performance compared

with the original version in the simulation. To the best of

our knowledge, this is the ﬁrst work proposing and study-

ing the concept of the opportunistic reinforcement learning.

We believe this work will serve as a foundation for the op-

portunistic reinforcement learning concept and help further

addressing the exploration-exploitation trade-off.

2. Related Work

Optimism in the face of uncertainty (OFU) is a popular

paradigm for the exploration-exploitation trade-off in RL,

where each pair of states and actions is offered some opti-

mism bonus. The agent then chooses a policy that is optimal

under the “optimistic” model of the environment. To learn

efﬁciently, it maintains some control over its uncertainty by

assigning a more substantial optimistic bonus to potentially

informative states and actions. The assigned bonus can

stimulate and guide the exploration process. Most OFU al-

gorithms provided strong theoretical guarantees (Azar et al.,

2017; Bartlett & Tewari, 2009; Jaksch et al., 2010; Dann

& Brunskill, 2015; Strehl et al., 2009). A popular competi-

tor to OFU algorithms is inspired by Thompson sampling

(TS) (Chapelle & Li, 2011). In RL, TS approaches (Strens,

2000) maintain a posterior distribution over the reward func-

tion and the transition kernel, then compute the optimal

policy for a random sampled MDP from the posterior. One

of the well-known TS algorithms in the literature is Pos-

terior Sampling for Reinforcement Learning (PSRL) (Ian

et al., 2013; Osband & Van Roy, 2017).

The opportunistic learning idea has been introduced in (Wu

et al., 2018) for classic

-armed bandits and in (Guo et al.,

2019) for context bandits. In the reinforcement learning,

the authors in (Dann et al., 2019) consider the case where

the each episode has a side context and propose ORLC al-

gorithm that can use the context information to estimate

the dynamic of the environment but not include the oppor-

tunistic concept, which is different from us. To the best

of our knowledge, no prior work has made formal mathe-

matical formulation and rigorous performance analysis for

opportunistic reinforcement learning.

3. Problem Formulation

We consider an RL problem in an episodic ﬁnite-horizon

Markov decision process (MDP),

M:= hS,A, H, P, ri

where

is a ﬁnite state space with carnality

|S| =S

is a

ﬁnite action space with carnality

|A| =A

is the horizon

that represents the number of time steps in each episode,

is a state transition distribution such that

P(·|s, a)

dictates a

distribution over state

if action

is taken for state

, and

r:S ×A → [0,1]

is the deterministic reward function. For

simplicity, we assume that the reward function

is known

to the agent but the transition distribution Pis unknown.

In each episode of this MDP, an initial state

s1∈ S

is chosen

arbitrarily by the environment before it starts. For each step

h∈[H]1

, the agent observes a state

sh∈ S

, selects an

action

ah∈ A

, receives a reward

r(sh, ah)

and then the

state transits to next state

sh+1 ∈ S

that is drawn from the

distribution P(·|sh, ah). The episode ends in state sH+1.

A policy for an agent during the episode is expressed as a

mapping

π:S × [H]→ A

. We write

Vπ

h:S → R

as the

value function at step

under policy

. For a state

s∈ S

Vπ

h(s)

is the expected return (i.e., sum of rewards) received

under policy

, starting from

s=sh∈ S

, i.e.,

Vπ

h(s) :=

EhPH

i=hr(si, π(si, i))



sh=si

. Because the action space,

state space, and horizon are ﬁnite, and the reward function

is deterministic, there always exits an optimal policy

π∗

that

attains the best value

V∗

h(s) = supπVπ

h(s)

for all

s∈ S

and

h∈[H]

. For an episode with initial state

, the quality of

a policy

is measured by the regret that is the gap between

the value function at step

under policy

and that under

optimal policy, i.e.,

V∗

1(s1)−Vπ

1(s1)

. The goal of the

classic RL problem is to consider a RL agent interacts with

the environment (MDP

) for

episodes

k∈[K]

in a

sequential manner and ﬁnd the optimal policy.

Next we introduce the

opportunistic reinforcement learn-

ing

in an episodic ﬁnite-horizon MDP. For each episode

k∈[K]

, let

Lk≥0

be an external

variation factor

and

not change during the episode. We assume

is indepen-

dent of the MDP

for

k∈[K]

. To distinguish different

1We write [n]for i∈N,1≤i≤n

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

OpportunisticEpisodicReinforcementLearningXiaoxiaoWang1NaderBouacida1XueyingGuo1XinLiu1AbstractInthispaper,weproposeandstudyopportunis-ticreinforcementlearning-anewvariantofre-inforcementlearningproblemswheretheregretofselectingasuboptimalactionvariesunderanexternalenvironmentalconditionknownastheva...

展开>> 收起<<

Opportunistic Episodic Reinforcement Learning.pdf

共9页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Opportunistic Episodic Reinforcement Learning

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: