Hybrid RL Using Both Offline and Online Data Can Make RL Efficient Yuda SongYifei ZhouAyush SekhariJ. Andrew BagnellAkshay KrishnamurthyWen Sun March 14 2023

2025-04-24 0 0 1.18MB 42 页 10玖币
侵权投诉
Hybrid RL: Using Both Offline and Online Data Can Make RL Efficient
Yuda Song*Yifei ZhouAyush SekhariJ. Andrew Bagnell§Akshay KrishnamurthyWen Sun||
March 14, 2023
Abstract
We consider a hybrid reinforcement learning setting (Hybrid RL), in which an agent has access to an offline
dataset and the ability to collect experience via real-world online interaction. The framework mitigates the
challenges that arise in both pure offline and online RL settings, allowing for the design of simple and highly
effective algorithms, in both theory and practice. We demonstrate these advantages by adapting the classical Q
learning/iteration algorithm to the hybrid setting, which we call Hybrid Q-Learning or Hy-Q. In our theoretical
results, we prove that the algorithm is both computationally and statistically efficient whenever the offline dataset
supports a high-quality policy and the environment has bounded bilinear rank. Notably, we require no assumptions
on the coverage provided by the initial distribution, in contrast with guarantees for policy gradient/iteration
methods. In our experimental results, we show that Hy-Q with neural network function approximation outperforms
state-of-the-art online, offline, and hybrid RL baselines on challenging benchmarks, including Montezuma’s
Revenge.
1 Introduction
Learning by interacting with an environment, in the standard online reinforcement learning (RL) protocol, has led to
impressive results across a number of domains. State-of-the-art RL algorithms are quite general, employing function
approximation to scale to complex environments with minimal domain expertise and inductive bias. However, online
RL agents are also notoriously sample inefficient, often requiring billions of environment interactions to achieve
suitable performance. This issue is particularly salient when the environment requires sophisticated exploration
and a high quality reset distribution is unavailable to help overcome the exploration challenge. As a consequence,
the practical success of online RL and related policy gradient/improvement methods has been largely restricted to
settings where a high quality simulator is available.
To overcome the issue of sample inefficiency, attention has turned to the offline RL setting [Levine et al.,2020],
where, rather than interacting with the environment, the agent trains on a large dataset of experience collected in
some other manner (e.g., by a system running in production or an expert). While these methods still require a large
dataset, they mitigate the sample complexity concerns of online RL, since the dataset can be collected without
compromising system performance. However, offline RL methods can suffer from distribution shift, where the
state distribution induced by the learned policy differs significantly from the offline distribution [Wang et al.,2021].
Existing provable approaches for addressing distribution shift are computationally intractable, while empirical
approaches rely on heuristics that can be sensitive to the domain and offline dataset (as we will see).
*Carnegie Mellon University. Email: yudas@cs.cmu.edu
Cornell University. Email: yz639@cornell.edu. The first two authors have equal contributions to the paper.
MIT. Email: sekhari@mit.edu
§Carnegie Mellon University. Email: dbagnell@aurora.tech
Microsoft Research. Email: akshaykr@microsoft.com
||Cornell University. Email: ws455@cornell.edu
1
arXiv:2210.06718v3 [cs.LG] 11 Mar 2023
0 50m 100m 150m 200m
Number of frames
0
1k
2k
3k
4k
5k
6k
7k
Per-episode reward
Montezuma's Revenge
Expert
RND
Hy-Q (easy)
Hy-Q (medium)
Hy-Q (hard)
Figure 1: Performance of our approach Hy-Q on Montezuma’s Revenge. We consider three types of offline datasets:
easy, medium, and hard. The easy one contains offline data from a high quality policy (denoted by Expert in the
figure), the medium one consists of 20% data from a random policy and 80% from the expert, and the hard one
consists of half random data and half expert data. All offline datasets have 100k tuples of state, action, reward, and
next state (no trajectory-wise information is available to the learners). With only 0.1m offline samples, our approach
Hy-Q learns nearly 10x faster than Random Network Distillation (RND) – an online deep RL baseline designed
for Montezuma’s revenge (see Figure 6 for a visual comparison using screenshots of the game). We also note that
Conservative Q-Learning (CQL)—an offline RL baseline, completely fails in all three offline datasets, indicating
the hardness of offline policy learning from such small size offline datasets. Imitation Learning baseline Behavior
Cloning (BC) achieves reasonable performance in easy and medium datasets, but fails completely on the hard dataset.
See Section 6.2 for more detailed comparison to CQL, BC, and other baselines.
In this paper, we focus on a hybrid reinforcement learning setting, which we call Hybrid RL, that draws on the
favorable properties of both offline and online settings. In Hybrid RL, the agent has both an offline dataset and the
ability to interact with the environment, as in the traditional online RL setting. The offline dataset helps address the
exploration challenge, allowing us to greatly reduce the number of interactions required. Simultaneously, we can
identify and correct distribution shift issues via online interaction. Variants of the setting have been studied in a
number of empirical works [Rajeswaran et al.,2017,Hester et al.,2018,Nair et al.,2018,2020,Vecerik et al.,2017]
which mainly focus on using expert demonstrations as offline data.
Hybrid RL is closely related to the reset setting, where the agent can interact with the environment starting from
a “nice” distribution. A number of simple and effective algorithms, including CPI [Kakade and Langford,2002],
PSDP [Bagnell et al.,2003], and policy gradient methods [Kakade,2001,Agarwal et al.,2020b] – methods which
have inspired recent powerful heuristic RL methods such as TRPO [Schulman et al.,2015] and PPO [Schulman et al.,
2017]— are provably efficient in the reset setting. Yet, leveraging a reset distribution is a strong requirement (often
tantamount to having access to a detailed simulation) and unlikely to be available in real world applications. Hybrid
RL differs from the reset setting in that (a) we have an offline dataset, but (b) our online interactions must come from
traces that start at the initial state distribution of the environment, and the initial state distribution is not assumed
to have any nice properties. Both features (offline data and a nice reset distribution) facilitate algorithm design by
de-emphasizing the exploration challenge. However, Hybrid RL is more practical since an offline dataset is much
easier to access, while a nice reset distribution or even generative model is generally not guaranteed in practice.
We showcase the Hybrid RL setting with a new algorithm, Hybrid Q learning or Hy-Q (pronounced: Haiku). The
2
algorithm is a simple adaptation of the classical fitted Q-iteration algorithm (FQI) and it accommodates value-based
function approximation in a relatively general setup.
1
For our theoretical results, we prove that Hy-Q is both
statistically and computationally efficient assuming that: (1) the offline distribution covers some high quality policy,
(2) the MDP has low bilinear rank, (3) the function approximator is Bellman complete, and (4) we have a least
squares regression oracle. The first three assumptions are standard statistical assumptions in the RL literature
while the fourth is a widely used computational abstraction for supervised learning. No computationally efficient
algorithms are known under these assumptions in pure offline or pure online settings, which highlights the advantages
of the hybrid setting.
We also implement Hy-Q and evaluate it on two challenging RL benchmarks: a rich observation combination
lock [Misra et al.,2020] and Montezuma’s Revenge from the Arcade Learning Environment [Bellemare et al.,2013].
Starting with an offline dataset that contains some transitions from a high quality policy, our approach outperforms:
an online RL baseline with theoretical guarantees, an online deep RL baseline tuned for Montezuma’s Revenge, a
pure offline RL baseline, an imitation learning baseline, and an existing hybrid method. Compared to the online
methods, Hy-Q requires only a small fraction of the online experience, demonstrating its sample efficiency (e.g.,
Figure 1). Compared to the offline and hybrid methods, Hy-Q performs most favorably when the offline dataset
also contains many interactions from low quality policies, demonstrating its robustness. These results reveal the
significant benefits that can be realized by combining offline and online data.
2 Related Works
We discuss related works from four categories: pure online RL, online RL with access to a reset distribution, offline
RL, and prior work in hybrid settings. We note that pure online RL refers to the setting where one can only reset the
system to initial state distribution of the environment, which is not assumed to provide any form of coverage.
Pure online RL
Beyond tabular settings, many existing statistically efficient RL algorithms are not computationally
tractable, due to the difficulty of implementing optimism. This is true in the linear MDP [Jin et al.,2020] with
large action spaces, the linear Bellman complete model [Zanette et al.,2020,Agarwal et al.,2019], and in the
general function approximation setting [Jiang et al.,2017,Sun et al.,2019,Du et al.,2021,Jin et al.,2021a]. These
computational challenges have inspired results on intractability of aspects of online RL [Dann et al.,2018,Kane
et al.,2022]. On the other hand, many simple exploration based algorithms like
-greedy are computationally
efficient, but they may not always work well in practice. Recent theoretical works [Dann et al.,2022,Liu and
Brunskill,2018] have explored the additional structural assumptions on the underlying dynamics and value function
class under which -greedy succeeds, but they still do not capture all the relevant practical problems.
There are several online RL algorithms that aim to tackle the computational issue via stronger structural
assumptions and supervised learning-style computational oracles [Misra et al.,2020,Sekhari et al.,2021,Zhang
et al.,2022c,Agarwal et al.,2020a,Uehara et al.,2021,Modi et al.,2021,Zhang et al.,2022a,Qiu et al.,2022].
Compared to these oracle-based methods, our approach operates in the more general “bilinear rank” setting and
relies on a standard supervised learning primitive: least squares regression. Notably, our oracle admits efficient
implementation with linear function approximation, so we obtain an end-to-end computational guarantee; this is not
true for prior oracle-based methods.
There are many deep RL methods for the online setting (e.g., Schulman et al. [2015,2017], Lillicrap et al. [2016],
Haarnoja et al. [2018], Schrittwieser et al. [2020]). Apart from a few exceptions (e.g., Burda et al. [2018], Badia
et al. [2020], Guo et al. [2022]), most rely on random exploration (e.g.
-greedy) and are not capable of strategic
1
We use Q-learning and Q-iteration interchangeably, although they are not strictly speaking the same algorithm. Our theoretical results
analyze Q-iteration, but we use an algorithm with an online/mini-batch flavor that is closer to Q-learning for our experiments.
3
exploration. In fact, guarantees for
-greedy like algorithms only exist under additional structural assumptions on
the underlying problem.
In our experiments, we test our approach on Montezuma’s Revenge, and we pick RND [Burda et al.,2018] as a
deep RL exploration baseline due to its simplicity and effectiveness on the game of Montezuma’s Revenge.
Online RL with reset distributions
When an exploratory reset distribution is available, a number of statistically
and computationally efficient algorithms are known. The classic algorithms are CPI [Kakade and Langford,2002],
PSDP [Bagnell et al.,2003], Natural Policy Gradient [Kakade,2001,Agarwal et al.,2020b], and POLYTEX [Abbasi-
Yadkori et al.,2019]. Uchendu et al. [2022] recently demonstrated that algorithms like PSDP work well when
equipped with modern neural network function approximators. However, these algorithms (and their analyses)
heavily rely on the reset distribution to mitigate the exploration challenge, but such a reset distribution is typically
unavailable in practice, unless one also has a simulator and access to its internal states. In contrast, we assume the
offline data covers some high quality policy (no need to be globally exploratory), which helps with exploration, but
we do not require an exploratory reset distribution. This makes the hybrid setting much more practically appealing.
Offline RL
Offline RL methods learn policies solely from a given offline dataset, with no interaction whatsoever.
When the dataset has global coverage, algorithms such as FQI [Munos and Szepesv
´
ari,2008,Chen and Jiang,
2019] or certainty-equivalence model learning [Ross and Bagnell,2012], can find near-optimal policies in an
oracle-efficient manner, via least squares or model-fitting oracles. However, with only partial coverage, existing
methods either (a) are not computationally efficient due to the difficulty of implementing pessimism both in linear
settings with large action spaces [Jin et al.,2021b,Zhang et al.,2022b,Chang et al.,2021] and general function
approximation settings [Uehara and Sun,2021,Xie et al.,2021a,Jiang and Huang,2020,Chen and Jiang,2022,
Zhan et al.,2022], or (b) require strong representation conditions such as policy-based Bellman completeness [Xie
et al.,2021a,Zanette et al.,2021]. In contrast, in the hybrid setting, we obtain an efficient algorithm under the more
natural condition of completeness w.r.t., the Bellman optimality operator only.
Among the many empirical offline RL methods (e.g., Kumar et al. [2020], Yu et al. [2021], Kostrikov et al.
[2021], Fujimoto and Gu [2021]), we use CQL [Kumar et al.,2020] as a baseline in our experiments, since it has
been shown to work in image-based settings such as Atari games.
Online RL with offline datasets
Ross and Bagnell [2012] developed a model-based algorithm for a similar hybrid
setting. In comparison, our approach is model-free and consequently may be more suitable for high-dimensional
state spaces (e.g., raw-pixel images). Xie et al. [2021b] studied hybrid RL and show that offline data does not yield
statistical improvements in tabular MDPs. Our work instead focuses on the function approximation setting and
demonstrates computational benefits of hybrid RL.
On the empirical side, several works consider combining offline expert demonstrations with online interaction
[Rajeswaran et al.,2017,Hester et al.,2018,Nair et al.,2018,2020,Vecerik et al.,2017]. A common challenge in
offline RL is the robustness against low-quality offline dataset. Previous works mostly focus on expert demonstrations
and have no rigorous guarantees for such robustness. In fact, Nair et al. [2020] showed that such degradation in
performance indeed happens in practice with low-quality offline data. In our experiments, we observe that DQfD
[Hester et al.,2018] also has a similar degradation. On the other hand, our algorithm is robust to the quality of the
offline data. Note that the core idea of our algorithm is similar to that of Vecerik et al. [2017], who adapt DDPG
to the setting of combining RL with expert demonstrations for continuous control. Although Vecerik et al. [2017]
does not provide any theoretical results, it may be possible to combine our theoretical insights with existing analyses
for policy gradient methods to establish some guarantees of the algorithm from Vecerik et al. [2017] for the hybrid
RL setting. We also include a detailed comparison with previous empirical work in Appendix D.
4
Algorithm 1 Hybrid Q-learning using both offline and online data (Hy-Q)
Require: Value class: F, #iterations: T, offline dataset Dν
hof size moff =Tfor h[H1].
1: Initialize f1
h(s, a)=0.
2: for t= 1, . . . , T do
3: Let πtbe the greedy policy w.r.t. fti.e., πt
h(s) = argmaxaft
h(s, a).
4: For each h, collect mon = 1 online tuples Dt
hdπt
h.// Online collection
// FQI using both online and offline data
5: Set ft+1
H(s, a)=0.
6: for h=H1,...,0do
7: Estimate ft+1
husing least squares regression on the aggregated data Dt
h=Dν
h+Pt
τ=1 Dτ
h:
ft+1
hargmin
f∈Fhb
EDt
h(f(s, a)rmax
a0ft+1
h+1(s0, a0))2(1)
8: end for
9: end for
3 Preliminaries
We consider finite horizon Markov Decision Process
M(S,A, H, R, P, d0)
, where
S
is the state space,
A
is the
action space,
H
denotes the horizon, stochastic rewards
R(s, a)∆([0,1])
and
P(s, a)∆(S)
are the reward
and transition distributions at
(s, a)
, and
d0∆(S)
is the initial distribution. We assume the agent can only
reset from
d0
(at the beginning of each episode). Since the optimal policy is non-stationary in this setting, we
define a policy
π:= {π0, . . . , πH1}
where
πh:S 7→ ∆(A)
. Given
π
,
dπ
h∆(S × A)
denotes the state-action
occupancy induced by
π
at step
h
. Given
π
, we define the state and state-action value functions in the usual manner:
Vπ
h(s) = E[PH1
τ=hrτ|π, sh=s]
and
Qπ
h(s, a) = E[PH1
τ=hrτ|π, sh=s, ah=a]
.
Q?
and
V?
denote the optimal
value functions. We denote
Vπ=Es0d0Vπ
0(s0)
as the expected total reward of
π
. We define the Bellman operator
Tsuch that for any f:S × A 7→ R,
Tf(s, a) = E[R(s, a)] + Es0P(s,a)max
a0f(s0, a0)s, a,
We assume that for each
h
we have an offline dataset of
moff
samples
(s, a, r, s0)
drawn iid via
(s, a)νh, r
R(s, a), s0P(s, a)
. Here
ν={ν0, . . . , νH1}
denote the corresponding offline data distributions. For a dataset
D
, we use
ˆ
ED[·]
to denote a sample average over this dataset. For our theoretical results, we will assume that
ν
covers some high-quality policy. Note that covering a high quality policy does not mean that
ν
itself is a distribution
of some high quality policy. For instance,
ν
could be a mixture distribution of some high quality policy and a few
low-quality policies, in which case, treating
ν
as expert demonstration will fail completely (as we will show in our
experiments). We consider the value-based function approximation setting, where we are given a function class
F=F0× ··· × FH1
with
Fh S × A 7→ [0, Vmax]
that we use to approximate the value functions for the
underlying MDP. Here
Vmax
denotes the maximum total reward of a trajectory. For ease of notation, we define
f=
{f0, . . . , fH1}
and define
πf
to be the greedy policy w.r.t.,
f
, which chooses actions as
πf
h(s) = argmaxafh(s, a)
.
4 Hybrid Q-Learning
In this section, we present our algorithm Hybrid Q Learning – Hy-Q in Algorithm 1. Hy-Q takes an offline
dataset
{Dν}H
h=1
that contains
(s, a, r, s0)
tuples and a Q function class
Fh⊂ S × A 7→ [0, H], h = 1, . . . , H,
as
inputs, and outputs a policy that optimizes the given reward function. The algorithm is conceptually simple: it
5
摘要:

HybridRL:UsingBothOfineandOnlineDataCanMakeRLEfcientYudaSong*YifeiZhou†AyushSekhari‡J.AndrewBagnell§AkshayKrishnamurthy¶WenSun||March14,2023AbstractWeconsiderahybridreinforcementlearningsetting(HybridRL),inwhichanagenthasaccesstoanofinedatasetandtheabilitytocollectexperienceviareal-worldonlineint...

展开>> 收起<<
Hybrid RL Using Both Offline and Online Data Can Make RL Efficient Yuda SongYifei ZhouAyush SekhariJ. Andrew BagnellAkshay KrishnamurthyWen Sun March 14 2023.pdf

共42页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:42 页 大小:1.18MB 格式:PDF 时间:2025-04-24

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 42
客服
关注