The Role of Coverage in Online Reinforcement Learning Tengyang Xie tx10illinois.eduDylan J. Foster

2025-05-06 0 0 1.09MB 44 页 10玖币

侵权投诉

The Role of Coverage in Online Reinforcement Learning

Tengyang Xie∗

tx10@illinois.edu

Dylan J. Foster∗

dylanfoster@microsoft.com

Yu Bai

yu.bai@salesforce.com

Nan Jiang

nanjiang@illinois.edu

Sham M. Kakade

sham@seas.harvard.edu

Abstract

Coverage conditions—which assert that the data logging distribution adequately covers the state

space—play a fundamental role in determining the sample complexity of oﬄine reinforcement learning.

While such conditions might seem irrelevant to online reinforcement learning at ﬁrst glance, we establish

a new connection by showing—somewhat surprisingly—that the mere existence of a data distribution

with good coverage can enable sample-eﬃcient online RL. Concretely, we show that coverability—that is,

existence of a data distribution that satisﬁes a ubiquitous coverage condition called concentrability—can

be viewed as a structural property of the underlying MDP, and can be exploited by standard algorithms

for sample-eﬃcient exploration, even when the agent does not know said distribution. We complement

this result by proving that several weaker notions of coverage, despite being suﬃcient for oﬄine RL,

are insuﬃcient for online RL. We also show that existing complexity measures for online RL, including

Bellman rank and Bellman-Eluder dimension, fail to optimally capture coverability, and propose a new

complexity measure, the sequential extrapolation coeﬃcient, to provide a uniﬁcation.

1 Introduction

The last decade has seen development of reinforcement learning algorithms with strong empirical performance

in domains including robotics (Kober et al.,2013;Lillicrap et al.,2015), dialogue systems (Li et al.,2016), and

personalization (Agarwal et al.,2016;Tewari and Murphy,2017). While there is great interest in applying

these techniques to real-world decision making applications, the number of samples (steps of interaction)

required to do so is often prohibitive, with state-of-the-art algorithms requiring millions of samples to reach

human-level performance in challenging domains. Developing algorithms with improved sample eﬃciency,

which entails eﬃciently generalizing across high-dimensional states and actions while taking advantage of

problem structure as modeled practitioners, remains a major challenge.

Investigation into design and analysis of algorithms for sample-eﬃcient reinforcement learning has largely

focused on two distinct problem formulations:

•

Online reinforcement learning, where the learner can repeatedly interact with the environment by

executing a policy and observing the resulting trajectory.

•

Oﬄine reinforcement learning, where the learner has access to logged transitions ands reward gathered

from a ﬁxed behavioral policy (e.g., historical data or expert demonstrations), but cannot directly

interact with the underlying environment.

While these formulations share a common goal (learning a near-optimal policy), the algorithms used to

achieve this goal and conditions under which it can be achieved are seemingly quite diﬀerent. Focusing on

value function approximation, sample-eﬃcient algorithms for online reinforcement learning require both (a)

representation conditions, which assert that the function approximator is ﬂexible enough to represent value

functions for the underlying MDP (optimal or otherwise), and (b) exploration conditions (or, structural

conditions) which limit the amount of exploration required to learn a near-optimal policy—typically by

∗Equal contribution

arXiv:2210.04157v1 [cs.LG] 9 Oct 2022

enabling extrapolation across states or limiting the number of eﬀective state distributions (Russo and Van Roy,

2013;Jiang et al.,2017;Sun et al.,2019;Wang et al.,2020c;Du et al.,2021;Jin et al.,2021a;Foster et al.,

2021). Algorithms for oﬄine reinforcement learning typically require similar representation conditions.

However, since data is collected passively from a ﬁxed logging policy/distribution rather than actively, the

exploration conditions used in online RL are replaced with coverage conditions, which assert that the data

collection distribution provides suﬃcient coverage over the state space (Antos et al.,2008;Chen and Jiang,

2019;Xie and Jiang,2020,2021;Jin et al.,2021b;Rashidinejad et al.,2021;Foster et al.,2022;Zhan et al.,

2022). The aim for both lines of research (online and oﬄine) is to identify the weakest possible conditions

under which learning is possible, and design algorithms that take advantage of these conditions. The two

lines have largely evolved in parallel, and it is natural to wonder whether there are deeper connections. Since

the conditions for sample-eﬃcient online RL and oﬄine RL mainly diﬀer via exploration versus coverage, this

leads us to ask:

If an MDP admits a data distribution with favorable coverage for oﬄine RL, what does this imply

about our ability to perform online RL eﬃciently?

Beyond intrinsic theoretical value, this question is motivated by the observation that many real-world

applications lie on a spectrum between oﬄine and oﬄine. It is common for the learner to have access to

logged/oﬄine data, yet also have the ability to actively interact with the underlying environment, possibly

subject to limitations such as an exploration budget (Kalashnikov et al.,2018). Building a theory of real-world

RL that can lead to algorithm design insights for such settings requires understanding the interplay between

online and oﬄine RL.

1.1 Our Results

We investigate connections between coverage conditions in oﬄine reinforcement learning and exploration

in online reinforcement learning by focusing on the concentrability coeﬃcient, the most ubiquitous notion

of coverage in oﬄine RL. Concentrability quantiﬁes the extent to which the data collection distribution

uniformly covers the state-action distribution induced by any policy. We introduce a new structural property,

coverability, which reﬂects the best concentrability coeﬃcient that can achieved by any data distribution,

possibly designed by an oracle with knowledge of the underlying MDP.

We show (Section 3) that coverability (that is, mere existence of a distribution with good concentrability)

is suﬃcient for sample-eﬃcient online exploration, even when the learner has no prior knowledge of this

distribution. This result requires no additional assumptions on the underlying MDP beyond standard

Bellman completeness, and—perhaps surprisingly—is achieved using standard algorithms (Jin et al.,

2021a), albeit with analysis ideas that go beyond existing techniques.

We show (Section 4) that several weaker notions of coverage in oﬄine RL, including single-policy

concentrability (Jin et al.,2021b;Rashidinejad et al.,2021) and conditions based on Bellman residu-

als (Chen and Jiang,2019;Xie et al.,2021a;Cheng et al.,2022), are insuﬃcient for sample-eﬃcient

online exploration. This shows that in general, coverage in oﬄine reinforcement learning and exploration

in online RL not compatible, and highlights the need for additional investigation going forward.

Our results serve as a starting point for systematic study of connections between online and oﬄine learnability

in RL. To this end, we provide several secondary results:

We show (Section 5) that existing complexity measures for online RL, including Bellman rank and

Bellman-Eluder dimension, do not optimally capture coverability, and provide a new complexity measure,

the sequential extrapolation coeﬃcient, which uniﬁes these notions.

We establish (Section 3.3) connections between coverability and reinforcement learning with exogenous

noise, with applications to learning in exogenous block MDPs (Efroni et al.,2021,2022a).

We give algorithms for reward-free exploration (Jin et al.,2020a;Chen et al.,2022) under coverability

(Appendix D).

While our results primarily concern analysis of existing algorithms rather than algorithm design, they highlight

a number of exciting directions for future research, and we are optimistic that the notion of coverability can

guide the design of practical algorithms going forward.

Notation.

For an integer

n∈N

, we let [

]denote the set

{

, . . . , n}

. For a set

, we let ∆(

)denote the set

of all probability distributions over

. We adopt non-asymptotic big-oh notation: For functions

f, g

X → R+

we write

(

)(resp.

= Ω(

)) if there exists a constant

C >

0such that

(

)

≤Cg

(

)(resp.

f(x)≥Cg(x)) for all x∈ X. We write f=e

O(g)if f=O(g·polylog(T)),f=e

Ω(g)if f= Ω(g/polylog(T)),

and f=e

Θ(g)if f=e

O(g)and f=e

Ω(g). We write f∝gif f=e

Θ(g).

2 Background: Reinforcement Learning, Coverage, and Coverability

We begin by formally introducing the online and oﬄine reinforcement learning problems, then review the

concept of coverage in oﬄine reinforcement learning, focusing on concentrability. Based on this notion, we

introduce coverability as a structural property.

Markov decision processes.

We consider an episodic reinforcement learning setting. Formally, a Markov

decision process

= (

X,A, P, R, H, x1

)consists of a (potentially large) state space

, action space

, horizon

, probability transition function

{Ph}H

h=1

, where

X × A →

∆(

), reward function

{Rh}H

h=1

where

X × A → [0,1]

, and deterministic initial state

x1∈ X

A (randomized) policy is a sequence

of per-timestep functions

{πh:X → ∆(A)}H

h=1

. The policy induces a distribution over trajectories

(

x1, a1, r1

)

,...,

(

xH, aH, rH

)via the following process. For

= 1

, . . . , H

ah∼π

(

· | xh

(

xh, ah

and

xh+1 ∼Ph

(

· | xh, ah

). For notational convenience, we use

xH+1

to denote a deterministic terminal state

with zero reward. We let

Eπ[·]

and

Pπ

[

]denote expectation and probability under this process, respectively.

The expected reward for policy

is given

(

)

EπPH

h=1 rh

, and the value function and

-function for

are given by

Vπ

h(x):=EπhPH

h0=hrh0|xh=xi,and Qπ

h(x, a):=EπhPH

h0=hrh0|xh=x, ah=ai.

We let

π?

{π?

h}H

h=1

denote the optimal (deterministic) policy, which satisﬁes the Bellman equation and

maximizes

Qπ

(

x, a

)for all (

x, a

)

∈ X × A

simultaneously; we deﬁne

Vπ?

and

Qπ?

. We deﬁne

the occupancy measure for policy

via

dπ

(

x, a

)

Pπ

[

x, ah

]and

dπ

(

)

Pπ

[

]. We let

denote the Bellman operator for layer

, deﬁned via [

Thf

](

x, a

) =

(

x, a

) +

Ex0∼Ph(x,a)

[

maxa0f

(

x0, a0

)] for

f:X × A → R.

Assumptions.

We assume that rewards are normalized such that

h=1 rh∈[0,1]

(Jiang and Agarwal,

2018;Wang et al.,2020a;Zhang et al.,2021;Jin et al.,2021a). To simplify technical presentation, we assume

that Xand Aare countable; we anticipate that this assumption can be removed.

2.1 Online Reinforcement Learning

Our main results concern online reinforcement learning in an episodic framework, where the learner repeatedly

interacts with an unknown MDP by executing a policy and observing the resulting trajectory, with the goal

of maximizing total reward.

Formally, the protocol proceeds in Trounds, where at each round t= 1, . . . , T, the learner:

•Selects a policy π(t)=π(t)

hH

h=1 to execute in the (unknown) underlying MDP M?.

•Observe the resulting trajectory (x(t)

1, a(t)

1, r(t)

1),...,(x(t)

H, a(t)

H, r(t)

H).

The learner’s goal is to minimize their cumulative regret, deﬁned via

Reg :=

t=1

J(π?)−J(π(t)).

1While our results assume that the initial state is ﬁxed for simplicity, this assumption is straightforward to relax.

To achieve sample-eﬃcient online reinforcement learning guarantees that do not depend on the size of the

state space, one typically appeals to value function approximation methods that take advantage of a function

class

F ⊂

(

X × A → R

)that attempts to model the value functions for the underlying MDP

(optimal or

otherwise). An active line of research provides structural conditions under which such approaches succeed

(Russo and Van Roy,2013;Jiang et al.,2017;Sun et al.,2019;Wang et al.,2020c;Du et al.,2021;Jin et al.,

2021a;Foster et al.,2021), based on assumptions that control the interplay between the function approximator

and the dynamics of the MDP

. These results require (i) representation conditions, which require that

is ﬂexible enough to model value functions of interest (e.g.,

Q?∈ F

ThFh+1 ⊆ Fh

) and (ii) exploration

conditions, which either explicitly or implicitly limit the amount of exploration required for a deliberate

algorithm to learn a near-optimal policy. This is typically accomplished by either enabling extrapolation

from states already visited, or by limiting the number of eﬀective state distributions that can be encountered.

2.2 Oﬄine Reinforcement Learning and Coverage Conditions

Our aim is to investigate parallels between online and oﬄine reinforcement learning. In oﬄine reinforcement

learning, the learner cannot actively execute policies in the underlying MDP

. Instead, for each layer

, they receive a dataset

tuples (

xh, ah, rh, xh+1

)with

(

xh, ah

xh+1 ∼Ph

(

· | xh, ah

), and

(

xh, ah

)

∼µh

i.i.d., where

µh∈

∆(

X × A

)is the data collection distribution; we deﬁne

{µh}H

h=1

. The

goal of the learner is to use this data to learn an ε-optimal policy bπ, that is:

J(π?)−J(bπ)≤ε.

Algorithms for oﬄine reinforcement learning require representation conditions similar to those required for

online RL. However, since it is not possible to actively explore the underlying MDP, one dispenses with

exploration conditions and instead considers coverage conditions, which require that each data distribution

µhsuﬃciently covers the state space.

As an example, consider Fitted Q-Iteration (FQI), one of the most well-studied oﬄine reinforcement learning

algorithms (Munos,2007;Munos and Szepesvári,2008;Chen and Jiang,2019). The algorithm, which uses

least-squares to approximate Bellman backups, is known to succeed under (i) a representation condition

known as Bellman completeness (or “completeness”), which requires that

Thf∈ Fh

for all

f∈ Fh+1

, and (ii)

a coverage condition called concentrability.

Deﬁnition 1

(Concentrability)

The concentrability coeﬃcient for a data distribution

{µh}H

h=1

and

policy class Πis given by

Cconc(µ):= sup

π∈Π,h∈[H]



dπ

µh



∞

Concentrability requires that the data distribution uniformly covers all possible induced state distributions.

With concentrability

and completeness, FQI can learn an

-optimal policy using

poly

(

Cconc

(

)

,log|F|, H, ε−1

)

samples. Importantly, this result scales only with the concentrability coeﬃcient

Cconc

(

)and the capacity

log|F|

for the function class, and has no explicit dependence on the size of the state space. There is a vast

literature which provides algorithms with similar, often more reﬁned guarantees (Chen and Jiang,2019;Xie

and Jiang,2020,2021;Jin et al.,2021b;Rashidinejad et al.,2021;Foster et al.,2022;Zhan et al.,2022).

2.3 The Coverability Coeﬃcient

Having seen that access to a data distribution

with low concentrability

Cconc

(

)is suﬃcient for sample-

eﬃcient oﬄine RL, we now ask what existence of such a distribution implies about our ability to perform

online RL. To this end, we introduce a new structural parameter, the coverability coeﬃcient, whose value

reﬂects the best concentrability coeﬃcient that can be achieved with oracle knowledge of the underlying

MDP M?.

Speciﬁcally, FQI requires concentrability with Πchosen to be the set of all admissible policies (see, e.g., Chen and Jiang,

2019). Other algorithms (Xie and Jiang,2020) can leverage concentrability w.r.t smaller policy classes.

Deﬁnition 2 (Coverability).The coverability coeﬃcient Ccov >0for a policy class Πis given by

Ccov := inf

µ1,...,µH∈∆(X ×A)sup

π∈Π,h∈[H]



dπ

µh



∞

Coverability is an intrinsic structural property of the MDP

which implicitly restricts the complexity of

the set of possible state distributions. While it is always the case that

Ccov ≤ |X| · |A|

, the coeﬃcient can be

signiﬁcantly smaller (in particular, independent of

|X|

) for benign MDPs such as block MDPs and MDPs

with low-rank structure (Chen and Jiang,2019, Prop 5). For example in block MDPs, the state space

is potentially very large (e.g., raw pixels for an Atari game), but can be mapped down to a small number

of unobserved latent states (e.g., the game’s underlying state machine) which determine the dynamics. In

this case, the coverability coeﬃcient scales only with the number of latent states, not with the size of

; see

Section 3.3 for further discussion and examples.

With this deﬁnition in mind, we ask: If the MDP

satisﬁes low coverability, is sample-eﬃcient online

reinforcement learning possible? Note that if the learner were given access to data from the distribution

that achieves the value of

Ccov

, it would be possible to simply appeal to oﬄine RL methods such as FQI, but

since the learner has no prior knowledge of

, this question is non-trivial, and requires deliberate exploration.

3 Coverability Implies Sample-Eﬃcient Online Exploration

We now present our main result, which shows that low coverability is suﬃcient for sample-eﬃcient online

exploration. We ﬁrst give the algorithm and regret bound (Section 3.1), then prove the result and give intuition

(Section 3.2). We conclude (Section 3.3) by highlighting additional structural properties of coverability and, as

an application, use these properties along with the main result to give regret bounds for learning in exogenous

block MDPs (Efroni et al.,2021).

3.1 Main Result

We work with a value function class

F1×···×FH

, where

Fh⊂

(

X×A →

1]), with the goal of modeling

value functions for the underlying MDP. We adopt the convention that

fH+1

= 0, and for each

f∈ F

, we let

πf

denote the greedy policy with

πf,h

(

)

argmaxa∈A fh

(

x, a

), and we use

(

x, πh

)

Ea∼πh(·|x)

[

(

x, a

)]

for any

πh

. We take our policy class to be the induced class Π

{πf|f∈ F}

for the remainder of the

paper unless otherwise stated. We make the following standard completeness assumption, which requires that

the value function class is closed under Bellman backups (Wang et al.,2020c;Jin et al.,2020b;Wang et al.,

2021b;Jin et al.,2021a).

Assumption 1 (Completeness).For all h∈[H], we have Thfh+1 ∈ Fhfor all fh+1 ∈ Fh+1.

Completeness implies that Fis realizable (that is, Q?∈ F), but is a stronger assumption in general.

We assume for simplicity that

|F| <∞

, and our results scale with

log|F|

; this can be extended to inﬁnite

classes via covering numbers using a standard analysis.

Algorithm.

Our result is based on a new analysis of the Golf algorithm of Jin et al. (2021a), which is

presented in Algorithm 1 .Golf is based on the principle of optimism in the face of uncertainty. At each

round, the algorithm restricts to a conﬁdence set

F(t)⊆ F

with the property that

Q?∈ F(t)

, and chooses

π(t)

πf(t)

based on the value function

f(t)∈ F(t)

with the most optimistic estimate

(

x1, πf,1

(

)) for the

total reward. The conﬁdence sets

F(t)

are based on an empirical proxy to squared Bellman error, and are

constructed in a global fashion that entails optimizing over

for all layers

h∈

[

]simultaneously (Zanette

et al.,2020a).

Note that while Golf was originally introduced to provide regret bounds based on the notion of Bellman-

Eluder dimension, we show (Section 5) that coverability cannot be (optimally) captured by this complexity

measure, necessitating a new analysis.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

TheRoleofCoverageinOnlineReinforcementLearningTengyangXie*tx10@illinois.eduDylanJ.Foster*dylanfoster@microsoft.comYuBaiyu.bai@salesforce.comNanJiangnanjiang@illinois.eduShamM.Kakadesham@seas.harvard.eduAbstractCoverageconditionswhichassertthatthedataloggingdistributionadequatelycoversthestatespace...

展开>> 收起<<

The Role of Coverage in Online Reinforcement Learning Tengyang Xie tx10illinois.eduDylan J. Foster.pdf

共44页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

The Role of Coverage in Online Reinforcement Learning Tengyang Xie tx10illinois.eduDylan J. Foster

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: