enabling extrapolation across states or limiting the number of effective state distributions (Russo and Van Roy,
2013;Jiang et al.,2017;Sun et al.,2019;Wang et al.,2020c;Du et al.,2021;Jin et al.,2021a;Foster et al.,
2021). Algorithms for offline reinforcement learning typically require similar representation conditions.
However, since data is collected passively from a fixed logging policy/distribution rather than actively, the
exploration conditions used in online RL are replaced with coverage conditions, which assert that the data
collection distribution provides sufficient coverage over the state space (Antos et al.,2008;Chen and Jiang,
2019;Xie and Jiang,2020,2021;Jin et al.,2021b;Rashidinejad et al.,2021;Foster et al.,2022;Zhan et al.,
2022). The aim for both lines of research (online and offline) is to identify the weakest possible conditions
under which learning is possible, and design algorithms that take advantage of these conditions. The two
lines have largely evolved in parallel, and it is natural to wonder whether there are deeper connections. Since
the conditions for sample-efficient online RL and offline RL mainly differ via exploration versus coverage, this
leads us to ask:
If an MDP admits a data distribution with favorable coverage for offline RL, what does this imply
about our ability to perform online RL efficiently?
Beyond intrinsic theoretical value, this question is motivated by the observation that many real-world
applications lie on a spectrum between offline and offline. It is common for the learner to have access to
logged/offline data, yet also have the ability to actively interact with the underlying environment, possibly
subject to limitations such as an exploration budget (Kalashnikov et al.,2018). Building a theory of real-world
RL that can lead to algorithm design insights for such settings requires understanding the interplay between
online and offline RL.
1.1 Our Results
We investigate connections between coverage conditions in offline reinforcement learning and exploration
in online reinforcement learning by focusing on the concentrability coefficient, the most ubiquitous notion
of coverage in offline RL. Concentrability quantifies the extent to which the data collection distribution
uniformly covers the state-action distribution induced by any policy. We introduce a new structural property,
coverability, which reflects the best concentrability coefficient that can achieved by any data distribution,
possibly designed by an oracle with knowledge of the underlying MDP.
1.
We show (Section 3) that coverability (that is, mere existence of a distribution with good concentrability)
is sufficient for sample-efficient online exploration, even when the learner has no prior knowledge of this
distribution. This result requires no additional assumptions on the underlying MDP beyond standard
Bellman completeness, and—perhaps surprisingly—is achieved using standard algorithms (Jin et al.,
2021a), albeit with analysis ideas that go beyond existing techniques.
2.
We show (Section 4) that several weaker notions of coverage in offline RL, including single-policy
concentrability (Jin et al.,2021b;Rashidinejad et al.,2021) and conditions based on Bellman residu-
als (Chen and Jiang,2019;Xie et al.,2021a;Cheng et al.,2022), are insufficient for sample-efficient
online exploration. This shows that in general, coverage in offline reinforcement learning and exploration
in online RL not compatible, and highlights the need for additional investigation going forward.
Our results serve as a starting point for systematic study of connections between online and offline learnability
in RL. To this end, we provide several secondary results:
1.
We show (Section 5) that existing complexity measures for online RL, including Bellman rank and
Bellman-Eluder dimension, do not optimally capture coverability, and provide a new complexity measure,
the sequential extrapolation coefficient, which unifies these notions.
2.
We establish (Section 3.3) connections between coverability and reinforcement learning with exogenous
noise, with applications to learning in exogenous block MDPs (Efroni et al.,2021,2022a).
3.
We give algorithms for reward-free exploration (Jin et al.,2020a;Chen et al.,2022) under coverability
(Appendix D).
While our results primarily concern analysis of existing algorithms rather than algorithm design, they highlight
a number of exciting directions for future research, and we are optimistic that the notion of coverability can
2