exploration. In fact, guarantees for
-greedy like algorithms only exist under additional structural assumptions on
the underlying problem.
In our experiments, we test our approach on Montezuma’s Revenge, and we pick RND [Burda et al.,2018] as a
deep RL exploration baseline due to its simplicity and effectiveness on the game of Montezuma’s Revenge.
Online RL with reset distributions
When an exploratory reset distribution is available, a number of statistically
and computationally efficient algorithms are known. The classic algorithms are CPI [Kakade and Langford,2002],
PSDP [Bagnell et al.,2003], Natural Policy Gradient [Kakade,2001,Agarwal et al.,2020b], and POLYTEX [Abbasi-
Yadkori et al.,2019]. Uchendu et al. [2022] recently demonstrated that algorithms like PSDP work well when
equipped with modern neural network function approximators. However, these algorithms (and their analyses)
heavily rely on the reset distribution to mitigate the exploration challenge, but such a reset distribution is typically
unavailable in practice, unless one also has a simulator and access to its internal states. In contrast, we assume the
offline data covers some high quality policy (no need to be globally exploratory), which helps with exploration, but
we do not require an exploratory reset distribution. This makes the hybrid setting much more practically appealing.
Offline RL
Offline RL methods learn policies solely from a given offline dataset, with no interaction whatsoever.
When the dataset has global coverage, algorithms such as FQI [Munos and Szepesv
´
ari,2008,Chen and Jiang,
2019] or certainty-equivalence model learning [Ross and Bagnell,2012], can find near-optimal policies in an
oracle-efficient manner, via least squares or model-fitting oracles. However, with only partial coverage, existing
methods either (a) are not computationally efficient due to the difficulty of implementing pessimism both in linear
settings with large action spaces [Jin et al.,2021b,Zhang et al.,2022b,Chang et al.,2021] and general function
approximation settings [Uehara and Sun,2021,Xie et al.,2021a,Jiang and Huang,2020,Chen and Jiang,2022,
Zhan et al.,2022], or (b) require strong representation conditions such as policy-based Bellman completeness [Xie
et al.,2021a,Zanette et al.,2021]. In contrast, in the hybrid setting, we obtain an efficient algorithm under the more
natural condition of completeness w.r.t., the Bellman optimality operator only.
Among the many empirical offline RL methods (e.g., Kumar et al. [2020], Yu et al. [2021], Kostrikov et al.
[2021], Fujimoto and Gu [2021]), we use CQL [Kumar et al.,2020] as a baseline in our experiments, since it has
been shown to work in image-based settings such as Atari games.
Online RL with offline datasets
Ross and Bagnell [2012] developed a model-based algorithm for a similar hybrid
setting. In comparison, our approach is model-free and consequently may be more suitable for high-dimensional
state spaces (e.g., raw-pixel images). Xie et al. [2021b] studied hybrid RL and show that offline data does not yield
statistical improvements in tabular MDPs. Our work instead focuses on the function approximation setting and
demonstrates computational benefits of hybrid RL.
On the empirical side, several works consider combining offline expert demonstrations with online interaction
[Rajeswaran et al.,2017,Hester et al.,2018,Nair et al.,2018,2020,Vecerik et al.,2017]. A common challenge in
offline RL is the robustness against low-quality offline dataset. Previous works mostly focus on expert demonstrations
and have no rigorous guarantees for such robustness. In fact, Nair et al. [2020] showed that such degradation in
performance indeed happens in practice with low-quality offline data. In our experiments, we observe that DQfD
[Hester et al.,2018] also has a similar degradation. On the other hand, our algorithm is robust to the quality of the
offline data. Note that the core idea of our algorithm is similar to that of Vecerik et al. [2017], who adapt DDPG
to the setting of combining RL with expert demonstrations for continuous control. Although Vecerik et al. [2017]
does not provide any theoretical results, it may be possible to combine our theoretical insights with existing analyses
for policy gradient methods to establish some guarantees of the algorithm from Vecerik et al. [2017] for the hybrid
RL setting. We also include a detailed comparison with previous empirical work in Appendix D.
4