The Role of Coverage in Online Reinforcement Learning Tengyang Xie tx10illinois.eduDylan J. Foster

2025-05-06 0 0 1.09MB 44 页 10玖币
侵权投诉
The Role of Coverage in Online Reinforcement Learning
Tengyang Xie
tx10@illinois.edu
Dylan J. Foster
dylanfoster@microsoft.com
Yu Bai
yu.bai@salesforce.com
Nan Jiang
nanjiang@illinois.edu
Sham M. Kakade
sham@seas.harvard.edu
Abstract
Coverage conditions—which assert that the data logging distribution adequately covers the state
space—play a fundamental role in determining the sample complexity of offline reinforcement learning.
While such conditions might seem irrelevant to online reinforcement learning at first glance, we establish
a new connection by showing—somewhat surprisingly—that the mere existence of a data distribution
with good coverage can enable sample-efficient online RL. Concretely, we show that coverability—that is,
existence of a data distribution that satisfies a ubiquitous coverage condition called concentrability—can
be viewed as a structural property of the underlying MDP, and can be exploited by standard algorithms
for sample-efficient exploration, even when the agent does not know said distribution. We complement
this result by proving that several weaker notions of coverage, despite being sufficient for offline RL,
are insufficient for online RL. We also show that existing complexity measures for online RL, including
Bellman rank and Bellman-Eluder dimension, fail to optimally capture coverability, and propose a new
complexity measure, the sequential extrapolation coefficient, to provide a unification.
1 Introduction
The last decade has seen development of reinforcement learning algorithms with strong empirical performance
in domains including robotics (Kober et al.,2013;Lillicrap et al.,2015), dialogue systems (Li et al.,2016), and
personalization (Agarwal et al.,2016;Tewari and Murphy,2017). While there is great interest in applying
these techniques to real-world decision making applications, the number of samples (steps of interaction)
required to do so is often prohibitive, with state-of-the-art algorithms requiring millions of samples to reach
human-level performance in challenging domains. Developing algorithms with improved sample efficiency,
which entails efficiently generalizing across high-dimensional states and actions while taking advantage of
problem structure as modeled practitioners, remains a major challenge.
Investigation into design and analysis of algorithms for sample-efficient reinforcement learning has largely
focused on two distinct problem formulations:
Online reinforcement learning, where the learner can repeatedly interact with the environment by
executing a policy and observing the resulting trajectory.
Offline reinforcement learning, where the learner has access to logged transitions ands reward gathered
from a fixed behavioral policy (e.g., historical data or expert demonstrations), but cannot directly
interact with the underlying environment.
While these formulations share a common goal (learning a near-optimal policy), the algorithms used to
achieve this goal and conditions under which it can be achieved are seemingly quite different. Focusing on
value function approximation, sample-efficient algorithms for online reinforcement learning require both (a)
representation conditions, which assert that the function approximator is flexible enough to represent value
functions for the underlying MDP (optimal or otherwise), and (b) exploration conditions (or, structural
conditions) which limit the amount of exploration required to learn a near-optimal policy—typically by
Equal contribution
1
arXiv:2210.04157v1 [cs.LG] 9 Oct 2022
enabling extrapolation across states or limiting the number of effective state distributions (Russo and Van Roy,
2013;Jiang et al.,2017;Sun et al.,2019;Wang et al.,2020c;Du et al.,2021;Jin et al.,2021a;Foster et al.,
2021). Algorithms for offline reinforcement learning typically require similar representation conditions.
However, since data is collected passively from a fixed logging policy/distribution rather than actively, the
exploration conditions used in online RL are replaced with coverage conditions, which assert that the data
collection distribution provides sufficient coverage over the state space (Antos et al.,2008;Chen and Jiang,
2019;Xie and Jiang,2020,2021;Jin et al.,2021b;Rashidinejad et al.,2021;Foster et al.,2022;Zhan et al.,
2022). The aim for both lines of research (online and offline) is to identify the weakest possible conditions
under which learning is possible, and design algorithms that take advantage of these conditions. The two
lines have largely evolved in parallel, and it is natural to wonder whether there are deeper connections. Since
the conditions for sample-efficient online RL and offline RL mainly differ via exploration versus coverage, this
leads us to ask:
If an MDP admits a data distribution with favorable coverage for offline RL, what does this imply
about our ability to perform online RL efficiently?
Beyond intrinsic theoretical value, this question is motivated by the observation that many real-world
applications lie on a spectrum between offline and offline. It is common for the learner to have access to
logged/offline data, yet also have the ability to actively interact with the underlying environment, possibly
subject to limitations such as an exploration budget (Kalashnikov et al.,2018). Building a theory of real-world
RL that can lead to algorithm design insights for such settings requires understanding the interplay between
online and offline RL.
1.1 Our Results
We investigate connections between coverage conditions in offline reinforcement learning and exploration
in online reinforcement learning by focusing on the concentrability coefficient, the most ubiquitous notion
of coverage in offline RL. Concentrability quantifies the extent to which the data collection distribution
uniformly covers the state-action distribution induced by any policy. We introduce a new structural property,
coverability, which reflects the best concentrability coefficient that can achieved by any data distribution,
possibly designed by an oracle with knowledge of the underlying MDP.
1.
We show (Section 3) that coverability (that is, mere existence of a distribution with good concentrability)
is sufficient for sample-efficient online exploration, even when the learner has no prior knowledge of this
distribution. This result requires no additional assumptions on the underlying MDP beyond standard
Bellman completeness, and—perhaps surprisingly—is achieved using standard algorithms (Jin et al.,
2021a), albeit with analysis ideas that go beyond existing techniques.
2.
We show (Section 4) that several weaker notions of coverage in offline RL, including single-policy
concentrability (Jin et al.,2021b;Rashidinejad et al.,2021) and conditions based on Bellman residu-
als (Chen and Jiang,2019;Xie et al.,2021a;Cheng et al.,2022), are insufficient for sample-efficient
online exploration. This shows that in general, coverage in offline reinforcement learning and exploration
in online RL not compatible, and highlights the need for additional investigation going forward.
Our results serve as a starting point for systematic study of connections between online and offline learnability
in RL. To this end, we provide several secondary results:
1.
We show (Section 5) that existing complexity measures for online RL, including Bellman rank and
Bellman-Eluder dimension, do not optimally capture coverability, and provide a new complexity measure,
the sequential extrapolation coefficient, which unifies these notions.
2.
We establish (Section 3.3) connections between coverability and reinforcement learning with exogenous
noise, with applications to learning in exogenous block MDPs (Efroni et al.,2021,2022a).
3.
We give algorithms for reward-free exploration (Jin et al.,2020a;Chen et al.,2022) under coverability
(Appendix D).
While our results primarily concern analysis of existing algorithms rather than algorithm design, they highlight
a number of exciting directions for future research, and we are optimistic that the notion of coverability can
2
guide the design of practical algorithms going forward.
Notation.
For an integer
nN
, we let [
n
]denote the set
{
1
, . . . , n}
. For a set
X
, we let ∆(
X
)denote the set
of all probability distributions over
X
. We adopt non-asymptotic big-oh notation: For functions
f, g
:
X R+
,
we write
f
=
O
(
g
)(resp.
f
= Ω(
g
)) if there exists a constant
C >
0such that
f
(
x
)
Cg
(
x
)(resp.
f(x)Cg(x)) for all x∈ X. We write f=e
O(g)if f=O(g·polylog(T)),f=e
Ω(g)if f= Ω(g/polylog(T)),
and f=e
Θ(g)if f=e
O(g)and f=e
Ω(g). We write fgif f=e
Θ(g).
2 Background: Reinforcement Learning, Coverage, and Coverability
We begin by formally introducing the online and offline reinforcement learning problems, then review the
concept of coverage in offline reinforcement learning, focusing on concentrability. Based on this notion, we
introduce coverability as a structural property.
Markov decision processes.
We consider an episodic reinforcement learning setting. Formally, a Markov
decision process
M
= (
X,A, P, R, H, x1
)consists of a (potentially large) state space
X
, action space
A
, horizon
H
, probability transition function
P
=
{Ph}H
h=1
, where
Ph
:
X × A
∆(
X
), reward function
R
=
{Rh}H
h=1
,
where
Rh
:
X × A [0,1]
, and deterministic initial state
x1∈ X
.
1
A (randomized) policy is a sequence
of per-timestep functions
π
=
{πh:X ∆(A)}H
h=1
. The policy induces a distribution over trajectories
(
x1, a1, r1
)
,...,
(
xH, aH, rH
)via the following process. For
h
= 1
, . . . , H
:
ahπ
(
· | xh
),
rh
=
Rh
(
xh, ah
),
and
xh+1 Ph
(
· | xh, ah
). For notational convenience, we use
xH+1
to denote a deterministic terminal state
with zero reward. We let
Eπ[·]
and
Pπ
[
·
]denote expectation and probability under this process, respectively.
The expected reward for policy
π
is given
J
(
π
)
:
=
EπPH
h=1 rh
, and the value function and
Q
-function for
π
are given by
Vπ
h(x):=EπhPH
h0=hrh0|xh=xi,and Qπ
h(x, a):=EπhPH
h0=hrh0|xh=x, ah=ai.
We let
π?
=
{π?
h}H
h=1
denote the optimal (deterministic) policy, which satisfies the Bellman equation and
maximizes
Qπ
h
(
x, a
)for all (
x, a
)
X × A
simultaneously; we define
V?
h
=
Vπ?
h
and
Q?
h
=
Qπ?
h
. We define
the occupancy measure for policy
π
via
dπ
h
(
x, a
)
:
=
Pπ
[
xh
=
x, ah
=
a
]and
dπ
h
(
x
)
:
=
Pπ
[
xh
=
x
]. We let
Th
denote the Bellman operator for layer
h
, defined via [
Thf
](
x, a
) =
Rh
(
x, a
) +
Ex0Ph(x,a)
[
maxa0f
(
x0, a0
)] for
f:X × A R.
Assumptions.
We assume that rewards are normalized such that
PH
h=1 rh[0,1]
(Jiang and Agarwal,
2018;Wang et al.,2020a;Zhang et al.,2021;Jin et al.,2021a). To simplify technical presentation, we assume
that Xand Aare countable; we anticipate that this assumption can be removed.
2.1 Online Reinforcement Learning
Our main results concern online reinforcement learning in an episodic framework, where the learner repeatedly
interacts with an unknown MDP by executing a policy and observing the resulting trajectory, with the goal
of maximizing total reward.
Formally, the protocol proceeds in Trounds, where at each round t= 1, . . . , T, the learner:
Selects a policy π(t)=π(t)
hH
h=1 to execute in the (unknown) underlying MDP M?.
Observe the resulting trajectory (x(t)
1, a(t)
1, r(t)
1),...,(x(t)
H, a(t)
H, r(t)
H).
The learner’s goal is to minimize their cumulative regret, defined via
Reg :=
T
X
t=1
J(π?)J(π(t)).
1While our results assume that the initial state is fixed for simplicity, this assumption is straightforward to relax.
3
To achieve sample-efficient online reinforcement learning guarantees that do not depend on the size of the
state space, one typically appeals to value function approximation methods that take advantage of a function
class
F ⊂
(
X × A R
)that attempts to model the value functions for the underlying MDP
M?
(optimal or
otherwise). An active line of research provides structural conditions under which such approaches succeed
(Russo and Van Roy,2013;Jiang et al.,2017;Sun et al.,2019;Wang et al.,2020c;Du et al.,2021;Jin et al.,
2021a;Foster et al.,2021), based on assumptions that control the interplay between the function approximator
F
and the dynamics of the MDP
M?
. These results require (i) representation conditions, which require that
F
is flexible enough to model value functions of interest (e.g.,
Q?∈ F
or
ThFh+1 ⊆ Fh
) and (ii) exploration
conditions, which either explicitly or implicitly limit the amount of exploration required for a deliberate
algorithm to learn a near-optimal policy. This is typically accomplished by either enabling extrapolation
from states already visited, or by limiting the number of effective state distributions that can be encountered.
2.2 Offline Reinforcement Learning and Coverage Conditions
Our aim is to investigate parallels between online and offline reinforcement learning. In offline reinforcement
learning, the learner cannot actively execute policies in the underlying MDP
M?
. Instead, for each layer
h
, they receive a dataset
Dh
of
n
tuples (
xh, ah, rh, xh+1
)with
rh
=
Rh
(
xh, ah
),
xh+1 Ph
(
· | xh, ah
), and
(
xh, ah
)
µh
i.i.d., where
µh
∆(
X × A
)is the data collection distribution; we define
µ
=
{µh}H
h=1
. The
goal of the learner is to use this data to learn an ε-optimal policy bπ, that is:
J(π?)J(bπ)ε.
Algorithms for offline reinforcement learning require representation conditions similar to those required for
online RL. However, since it is not possible to actively explore the underlying MDP, one dispenses with
exploration conditions and instead considers coverage conditions, which require that each data distribution
µhsufficiently covers the state space.
As an example, consider Fitted Q-Iteration (FQI), one of the most well-studied offline reinforcement learning
algorithms (Munos,2007;Munos and Szepesvári,2008;Chen and Jiang,2019). The algorithm, which uses
least-squares to approximate Bellman backups, is known to succeed under (i) a representation condition
known as Bellman completeness (or “completeness”), which requires that
Thf∈ Fh
for all
f∈ Fh+1
, and (ii)
a coverage condition called concentrability.
Definition 1
(Concentrability)
.
The concentrability coefficient for a data distribution
µ
=
{µh}H
h=1
and
policy class Πis given by
Cconc(µ):= sup
πΠ,h[H]
dπ
h
µh
.
Concentrability requires that the data distribution uniformly covers all possible induced state distributions.
With concentrability
2
and completeness, FQI can learn an
ε
-optimal policy using
poly
(
Cconc
(
µ
)
,log|F|, H, ε1
)
samples. Importantly, this result scales only with the concentrability coefficient
Cconc
(
µ
)and the capacity
log|F|
for the function class, and has no explicit dependence on the size of the state space. There is a vast
literature which provides algorithms with similar, often more refined guarantees (Chen and Jiang,2019;Xie
and Jiang,2020,2021;Jin et al.,2021b;Rashidinejad et al.,2021;Foster et al.,2022;Zhan et al.,2022).
2.3 The Coverability Coefficient
Having seen that access to a data distribution
µ
with low concentrability
Cconc
(
µ
)is sufficient for sample-
efficient offline RL, we now ask what existence of such a distribution implies about our ability to perform
online RL. To this end, we introduce a new structural parameter, the coverability coefficient, whose value
reflects the best concentrability coefficient that can be achieved with oracle knowledge of the underlying
MDP M?.
2
Specifically, FQI requires concentrability with Πchosen to be the set of all admissible policies (see, e.g., Chen and Jiang,
2019). Other algorithms (Xie and Jiang,2020) can leverage concentrability w.r.t smaller policy classes.
4
Definition 2 (Coverability).The coverability coefficient Ccov >0for a policy class Πis given by
Ccov := inf
µ1,...,µH∆(X ×A)sup
πΠ,h[H]
dπ
h
µh
.
Coverability is an intrinsic structural property of the MDP
M?
which implicitly restricts the complexity of
the set of possible state distributions. While it is always the case that
Ccov |X| · |A|
, the coefficient can be
significantly smaller (in particular, independent of
|X|
) for benign MDPs such as block MDPs and MDPs
with low-rank structure (Chen and Jiang,2019, Prop 5). For example in block MDPs, the state space
X
is potentially very large (e.g., raw pixels for an Atari game), but can be mapped down to a small number
of unobserved latent states (e.g., the game’s underlying state machine) which determine the dynamics. In
this case, the coverability coefficient scales only with the number of latent states, not with the size of
X
; see
Section 3.3 for further discussion and examples.
With this definition in mind, we ask: If the MDP
M?
satisfies low coverability, is sample-efficient online
reinforcement learning possible? Note that if the learner were given access to data from the distribution
µ
that achieves the value of
Ccov
, it would be possible to simply appeal to offline RL methods such as FQI, but
since the learner has no prior knowledge of
µ
, this question is non-trivial, and requires deliberate exploration.
3 Coverability Implies Sample-Efficient Online Exploration
We now present our main result, which shows that low coverability is sufficient for sample-efficient online
exploration. We first give the algorithm and regret bound (Section 3.1), then prove the result and give intuition
(Section 3.2). We conclude (Section 3.3) by highlighting additional structural properties of coverability and, as
an application, use these properties along with the main result to give regret bounds for learning in exogenous
block MDPs (Efroni et al.,2021).
3.1 Main Result
We work with a value function class
F
=
F1×···×FH
, where
Fh
(
X×A →
[0
,
1]), with the goal of modeling
value functions for the underlying MDP. We adopt the convention that
fH+1
= 0, and for each
f∈ F
, we let
πf
denote the greedy policy with
πf,h
(
x
)
:
=
argmaxa∈A fh
(
x, a
), and we use
fh
(
x, πh
)
:
=
Eaπh(·|x)
[
fh
(
x, a
)]
for any
πh
. We take our policy class to be the induced class Π
:
=
{πf|f∈ F}
for the remainder of the
paper unless otherwise stated. We make the following standard completeness assumption, which requires that
the value function class is closed under Bellman backups (Wang et al.,2020c;Jin et al.,2020b;Wang et al.,
2021b;Jin et al.,2021a).
Assumption 1 (Completeness).For all h[H], we have Thfh+1 ∈ Fhfor all fh+1 ∈ Fh+1.
Completeness implies that Fis realizable (that is, Q?∈ F), but is a stronger assumption in general.
We assume for simplicity that
|F| <
, and our results scale with
log|F|
; this can be extended to infinite
classes via covering numbers using a standard analysis.
Algorithm.
Our result is based on a new analysis of the Golf algorithm of Jin et al. (2021a), which is
presented in Algorithm 1 .Golf is based on the principle of optimism in the face of uncertainty. At each
round, the algorithm restricts to a confidence set
F(t)⊆ F
with the property that
Q?∈ F(t)
, and chooses
π(t)
=
πf(t)
based on the value function
f(t)∈ F(t)
with the most optimistic estimate
f1
(
x1, πf,1
(
x1
)) for the
total reward. The confidence sets
F(t)
are based on an empirical proxy to squared Bellman error, and are
constructed in a global fashion that entails optimizing over
fh
for all layers
h
[
H
]simultaneously (Zanette
et al.,2020a).
Note that while Golf was originally introduced to provide regret bounds based on the notion of Bellman-
Eluder dimension, we show (Section 5) that coverability cannot be (optimally) captured by this complexity
measure, necessitating a new analysis.
5
摘要:

TheRoleofCoverageinOnlineReinforcementLearningTengyangXie*tx10@illinois.eduDylanJ.Foster*dylanfoster@microsoft.comYuBaiyu.bai@salesforce.comNanJiangnanjiang@illinois.eduShamM.Kakadesham@seas.harvard.eduAbstractCoverageconditionswhichassertthatthedataloggingdistributionadequatelycoversthestatespace...

展开>> 收起<<
The Role of Coverage in Online Reinforcement Learning Tengyang Xie tx10illinois.eduDylan J. Foster.pdf

共44页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:44 页 大小:1.09MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 44
客服
关注