Hybrid RL Using Both Ofﬂine and Online Data Can Make RL Efﬁcient Yuda SongYifei ZhouAyush SekhariJ. Andrew BagnellAkshay KrishnamurthyWen Sun March 14 2023

2025-04-24 0 0 1.18MB 42 页 10玖币

侵权投诉

Hybrid RL: Using Both Ofﬂine and Online Data Can Make RL Efﬁcient

Yuda Song*Yifei Zhou†Ayush Sekhari‡J. Andrew Bagnell§Akshay Krishnamurthy¶Wen Sun||

March 14, 2023

Abstract

We consider a hybrid reinforcement learning setting (Hybrid RL), in which an agent has access to an ofﬂine

dataset and the ability to collect experience via real-world online interaction. The framework mitigates the

challenges that arise in both pure ofﬂine and online RL settings, allowing for the design of simple and highly

effective algorithms, in both theory and practice. We demonstrate these advantages by adapting the classical Q

learning/iteration algorithm to the hybrid setting, which we call Hybrid Q-Learning or Hy-Q. In our theoretical

results, we prove that the algorithm is both computationally and statistically efﬁcient whenever the ofﬂine dataset

supports a high-quality policy and the environment has bounded bilinear rank. Notably, we require no assumptions

on the coverage provided by the initial distribution, in contrast with guarantees for policy gradient/iteration

methods. In our experimental results, we show that Hy-Q with neural network function approximation outperforms

state-of-the-art online, ofﬂine, and hybrid RL baselines on challenging benchmarks, including Montezuma’s

Revenge.

1 Introduction

Learning by interacting with an environment, in the standard online reinforcement learning (RL) protocol, has led to

impressive results across a number of domains. State-of-the-art RL algorithms are quite general, employing function

approximation to scale to complex environments with minimal domain expertise and inductive bias. However, online

RL agents are also notoriously sample inefﬁcient, often requiring billions of environment interactions to achieve

suitable performance. This issue is particularly salient when the environment requires sophisticated exploration

and a high quality reset distribution is unavailable to help overcome the exploration challenge. As a consequence,

the practical success of online RL and related policy gradient/improvement methods has been largely restricted to

settings where a high quality simulator is available.

To overcome the issue of sample inefﬁciency, attention has turned to the ofﬂine RL setting [Levine et al.,2020],

where, rather than interacting with the environment, the agent trains on a large dataset of experience collected in

some other manner (e.g., by a system running in production or an expert). While these methods still require a large

dataset, they mitigate the sample complexity concerns of online RL, since the dataset can be collected without

compromising system performance. However, ofﬂine RL methods can suffer from distribution shift, where the

state distribution induced by the learned policy differs signiﬁcantly from the ofﬂine distribution [Wang et al.,2021].

Existing provable approaches for addressing distribution shift are computationally intractable, while empirical

approaches rely on heuristics that can be sensitive to the domain and ofﬂine dataset (as we will see).

*Carnegie Mellon University. Email: yudas@cs.cmu.edu

†Cornell University. Email: yz639@cornell.edu. The ﬁrst two authors have equal contributions to the paper.

‡MIT. Email: sekhari@mit.edu

§Carnegie Mellon University. Email: dbagnell@aurora.tech

¶Microsoft Research. Email: akshaykr@microsoft.com

||Cornell University. Email: ws455@cornell.edu

arXiv:2210.06718v3 [cs.LG] 11 Mar 2023

0 50m 100m 150m 200m

Number of frames

Per-episode reward

Montezuma's Revenge

Expert

RND

Hy-Q (easy)

Hy-Q (medium)

Hy-Q (hard)

Figure 1: Performance of our approach Hy-Q on Montezuma’s Revenge. We consider three types of ofﬂine datasets:

easy, medium, and hard. The easy one contains ofﬂine data from a high quality policy (denoted by Expert in the

ﬁgure), the medium one consists of 20% data from a random policy and 80% from the expert, and the hard one

consists of half random data and half expert data. All ofﬂine datasets have 100k tuples of state, action, reward, and

next state (no trajectory-wise information is available to the learners). With only 0.1m ofﬂine samples, our approach

Hy-Q learns nearly 10x faster than Random Network Distillation (RND) – an online deep RL baseline designed

for Montezuma’s revenge (see Figure 6 for a visual comparison using screenshots of the game). We also note that

Conservative Q-Learning (CQL)—an ofﬂine RL baseline, completely fails in all three ofﬂine datasets, indicating

the hardness of ofﬂine policy learning from such small size ofﬂine datasets. Imitation Learning baseline Behavior

Cloning (BC) achieves reasonable performance in easy and medium datasets, but fails completely on the hard dataset.

See Section 6.2 for more detailed comparison to CQL, BC, and other baselines.

In this paper, we focus on a hybrid reinforcement learning setting, which we call Hybrid RL, that draws on the

favorable properties of both ofﬂine and online settings. In Hybrid RL, the agent has both an ofﬂine dataset and the

ability to interact with the environment, as in the traditional online RL setting. The ofﬂine dataset helps address the

exploration challenge, allowing us to greatly reduce the number of interactions required. Simultaneously, we can

identify and correct distribution shift issues via online interaction. Variants of the setting have been studied in a

number of empirical works [Rajeswaran et al.,2017,Hester et al.,2018,Nair et al.,2018,2020,Vecerik et al.,2017]

which mainly focus on using expert demonstrations as ofﬂine data.

Hybrid RL is closely related to the reset setting, where the agent can interact with the environment starting from

a “nice” distribution. A number of simple and effective algorithms, including CPI [Kakade and Langford,2002],

PSDP [Bagnell et al.,2003], and policy gradient methods [Kakade,2001,Agarwal et al.,2020b] – methods which

have inspired recent powerful heuristic RL methods such as TRPO [Schulman et al.,2015] and PPO [Schulman et al.,

2017]— are provably efﬁcient in the reset setting. Yet, leveraging a reset distribution is a strong requirement (often

tantamount to having access to a detailed simulation) and unlikely to be available in real world applications. Hybrid

RL differs from the reset setting in that (a) we have an ofﬂine dataset, but (b) our online interactions must come from

traces that start at the initial state distribution of the environment, and the initial state distribution is not assumed

to have any nice properties. Both features (ofﬂine data and a nice reset distribution) facilitate algorithm design by

de-emphasizing the exploration challenge. However, Hybrid RL is more practical since an ofﬂine dataset is much

easier to access, while a nice reset distribution or even generative model is generally not guaranteed in practice.

We showcase the Hybrid RL setting with a new algorithm, Hybrid Q learning or Hy-Q (pronounced: Haiku). The

algorithm is a simple adaptation of the classical ﬁtted Q-iteration algorithm (FQI) and it accommodates value-based

function approximation in a relatively general setup.

For our theoretical results, we prove that Hy-Q is both

statistically and computationally efﬁcient assuming that: (1) the ofﬂine distribution covers some high quality policy,

(2) the MDP has low bilinear rank, (3) the function approximator is Bellman complete, and (4) we have a least

squares regression oracle. The ﬁrst three assumptions are standard statistical assumptions in the RL literature

while the fourth is a widely used computational abstraction for supervised learning. No computationally efﬁcient

algorithms are known under these assumptions in pure ofﬂine or pure online settings, which highlights the advantages

of the hybrid setting.

We also implement Hy-Q and evaluate it on two challenging RL benchmarks: a rich observation combination

lock [Misra et al.,2020] and Montezuma’s Revenge from the Arcade Learning Environment [Bellemare et al.,2013].

Starting with an ofﬂine dataset that contains some transitions from a high quality policy, our approach outperforms:

an online RL baseline with theoretical guarantees, an online deep RL baseline tuned for Montezuma’s Revenge, a

pure ofﬂine RL baseline, an imitation learning baseline, and an existing hybrid method. Compared to the online

methods, Hy-Q requires only a small fraction of the online experience, demonstrating its sample efﬁciency (e.g.,

Figure 1). Compared to the ofﬂine and hybrid methods, Hy-Q performs most favorably when the ofﬂine dataset

also contains many interactions from low quality policies, demonstrating its robustness. These results reveal the

signiﬁcant beneﬁts that can be realized by combining ofﬂine and online data.

2 Related Works

We discuss related works from four categories: pure online RL, online RL with access to a reset distribution, ofﬂine

RL, and prior work in hybrid settings. We note that pure online RL refers to the setting where one can only reset the

system to initial state distribution of the environment, which is not assumed to provide any form of coverage.

Pure online RL

Beyond tabular settings, many existing statistically efﬁcient RL algorithms are not computationally

tractable, due to the difﬁculty of implementing optimism. This is true in the linear MDP [Jin et al.,2020] with

large action spaces, the linear Bellman complete model [Zanette et al.,2020,Agarwal et al.,2019], and in the

general function approximation setting [Jiang et al.,2017,Sun et al.,2019,Du et al.,2021,Jin et al.,2021a]. These

computational challenges have inspired results on intractability of aspects of online RL [Dann et al.,2018,Kane

et al.,2022]. On the other hand, many simple exploration based algorithms like



-greedy are computationally

efﬁcient, but they may not always work well in practice. Recent theoretical works [Dann et al.,2022,Liu and

Brunskill,2018] have explored the additional structural assumptions on the underlying dynamics and value function

class under which -greedy succeeds, but they still do not capture all the relevant practical problems.

There are several online RL algorithms that aim to tackle the computational issue via stronger structural

assumptions and supervised learning-style computational oracles [Misra et al.,2020,Sekhari et al.,2021,Zhang

et al.,2022c,Agarwal et al.,2020a,Uehara et al.,2021,Modi et al.,2021,Zhang et al.,2022a,Qiu et al.,2022].

Compared to these oracle-based methods, our approach operates in the more general “bilinear rank” setting and

relies on a standard supervised learning primitive: least squares regression. Notably, our oracle admits efﬁcient

implementation with linear function approximation, so we obtain an end-to-end computational guarantee; this is not

true for prior oracle-based methods.

There are many deep RL methods for the online setting (e.g., Schulman et al. [2015,2017], Lillicrap et al. [2016],

Haarnoja et al. [2018], Schrittwieser et al. [2020]). Apart from a few exceptions (e.g., Burda et al. [2018], Badia

et al. [2020], Guo et al. [2022]), most rely on random exploration (e.g.



-greedy) and are not capable of strategic

We use Q-learning and Q-iteration interchangeably, although they are not strictly speaking the same algorithm. Our theoretical results

analyze Q-iteration, but we use an algorithm with an online/mini-batch ﬂavor that is closer to Q-learning for our experiments.

exploration. In fact, guarantees for



-greedy like algorithms only exist under additional structural assumptions on

the underlying problem.

In our experiments, we test our approach on Montezuma’s Revenge, and we pick RND [Burda et al.,2018] as a

deep RL exploration baseline due to its simplicity and effectiveness on the game of Montezuma’s Revenge.

Online RL with reset distributions

When an exploratory reset distribution is available, a number of statistically

and computationally efﬁcient algorithms are known. The classic algorithms are CPI [Kakade and Langford,2002],

PSDP [Bagnell et al.,2003], Natural Policy Gradient [Kakade,2001,Agarwal et al.,2020b], and POLYTEX [Abbasi-

Yadkori et al.,2019]. Uchendu et al. [2022] recently demonstrated that algorithms like PSDP work well when

equipped with modern neural network function approximators. However, these algorithms (and their analyses)

heavily rely on the reset distribution to mitigate the exploration challenge, but such a reset distribution is typically

unavailable in practice, unless one also has a simulator and access to its internal states. In contrast, we assume the

ofﬂine data covers some high quality policy (no need to be globally exploratory), which helps with exploration, but

we do not require an exploratory reset distribution. This makes the hybrid setting much more practically appealing.

Ofﬂine RL

Ofﬂine RL methods learn policies solely from a given ofﬂine dataset, with no interaction whatsoever.

When the dataset has global coverage, algorithms such as FQI [Munos and Szepesv

ari,2008,Chen and Jiang,

2019] or certainty-equivalence model learning [Ross and Bagnell,2012], can ﬁnd near-optimal policies in an

oracle-efﬁcient manner, via least squares or model-ﬁtting oracles. However, with only partial coverage, existing

methods either (a) are not computationally efﬁcient due to the difﬁculty of implementing pessimism both in linear

settings with large action spaces [Jin et al.,2021b,Zhang et al.,2022b,Chang et al.,2021] and general function

approximation settings [Uehara and Sun,2021,Xie et al.,2021a,Jiang and Huang,2020,Chen and Jiang,2022,

Zhan et al.,2022], or (b) require strong representation conditions such as policy-based Bellman completeness [Xie

et al.,2021a,Zanette et al.,2021]. In contrast, in the hybrid setting, we obtain an efﬁcient algorithm under the more

natural condition of completeness w.r.t., the Bellman optimality operator only.

Among the many empirical ofﬂine RL methods (e.g., Kumar et al. [2020], Yu et al. [2021], Kostrikov et al.

[2021], Fujimoto and Gu [2021]), we use CQL [Kumar et al.,2020] as a baseline in our experiments, since it has

been shown to work in image-based settings such as Atari games.

Online RL with ofﬂine datasets

Ross and Bagnell [2012] developed a model-based algorithm for a similar hybrid

setting. In comparison, our approach is model-free and consequently may be more suitable for high-dimensional

state spaces (e.g., raw-pixel images). Xie et al. [2021b] studied hybrid RL and show that ofﬂine data does not yield

statistical improvements in tabular MDPs. Our work instead focuses on the function approximation setting and

demonstrates computational beneﬁts of hybrid RL.

On the empirical side, several works consider combining ofﬂine expert demonstrations with online interaction

[Rajeswaran et al.,2017,Hester et al.,2018,Nair et al.,2018,2020,Vecerik et al.,2017]. A common challenge in

ofﬂine RL is the robustness against low-quality ofﬂine dataset. Previous works mostly focus on expert demonstrations

and have no rigorous guarantees for such robustness. In fact, Nair et al. [2020] showed that such degradation in

performance indeed happens in practice with low-quality ofﬂine data. In our experiments, we observe that DQfD

[Hester et al.,2018] also has a similar degradation. On the other hand, our algorithm is robust to the quality of the

ofﬂine data. Note that the core idea of our algorithm is similar to that of Vecerik et al. [2017], who adapt DDPG

to the setting of combining RL with expert demonstrations for continuous control. Although Vecerik et al. [2017]

does not provide any theoretical results, it may be possible to combine our theoretical insights with existing analyses

for policy gradient methods to establish some guarantees of the algorithm from Vecerik et al. [2017] for the hybrid

RL setting. We also include a detailed comparison with previous empirical work in Appendix D.

Algorithm 1 Hybrid Q-learning using both ofﬂine and online data (Hy-Q)

Require: Value class: F, #iterations: T, ofﬂine dataset Dν

hof size moﬀ =Tfor h∈[H−1].

1: Initialize f1

h(s, a)=0.

2: for t= 1, . . . , T do

3: Let πtbe the greedy policy w.r.t. fti.e., πt

h(s) = argmaxaft

h(s, a).

4: For each h, collect mon = 1 online tuples Dt

h∼dπt

h.// Online collection

// FQI using both online and offline data

5: Set ft+1

H(s, a)=0.

6: for h=H−1,...,0do

7: Estimate ft+1

husing least squares regression on the aggregated data Dt

h=Dν

h+Pt

τ=1 Dτ

ft+1

h←argmin

f∈Fhb

EDt

h(f(s, a)−r−max

a0ft+1

h+1(s0, a0))2(1)

8: end for

9: end for

3 Preliminaries

We consider ﬁnite horizon Markov Decision Process

M(S,A, H, R, P, d0)

, where

is the state space,

is the

action space,

denotes the horizon, stochastic rewards

R(s, a)∈∆([0,1])

and

P(s, a)∈∆(S)

are the reward

and transition distributions at

(s, a)

, and

d0∈∆(S)

is the initial distribution. We assume the agent can only

reset from

(at the beginning of each episode). Since the optimal policy is non-stationary in this setting, we

deﬁne a policy

π:= {π0, . . . , πH−1}

where

πh:S 7→ ∆(A)

. Given

dπ

h∈∆(S × A)

denotes the state-action

occupancy induced by

at step

. Given

, we deﬁne the state and state-action value functions in the usual manner:

Vπ

h(s) = E[PH−1

τ=hrτ|π, sh=s]

and

Qπ

h(s, a) = E[PH−1

τ=hrτ|π, sh=s, ah=a]

and

denote the optimal

value functions. We denote

Vπ=Es0∼d0Vπ

0(s0)

as the expected total reward of

. We deﬁne the Bellman operator

Tsuch that for any f:S × A 7→ R,

Tf(s, a) = E[R(s, a)] + Es0∼P(s,a)max

a0f(s0, a0)∀s, a,

We assume that for each

we have an ofﬂine dataset of

moﬀ

samples

(s, a, r, s0)

drawn iid via

(s, a)∼νh, r ∈

R(s, a), s0∼P(s, a)

. Here

ν={ν0, . . . , νH−1}

denote the corresponding ofﬂine data distributions. For a dataset

, we use

ED[·]

to denote a sample average over this dataset. For our theoretical results, we will assume that

covers some high-quality policy. Note that covering a high quality policy does not mean that

itself is a distribution

of some high quality policy. For instance,

could be a mixture distribution of some high quality policy and a few

low-quality policies, in which case, treating

as expert demonstration will fail completely (as we will show in our

experiments). We consider the value-based function approximation setting, where we are given a function class

F=F0× ··· × FH−1

with

Fh⊂ S × A 7→ [0, Vmax]

that we use to approximate the value functions for the

underlying MDP. Here

Vmax

denotes the maximum total reward of a trajectory. For ease of notation, we deﬁne

{f0, . . . , fH−1}

and deﬁne

πf

to be the greedy policy w.r.t.,

, which chooses actions as

πf

h(s) = argmaxafh(s, a)

4 Hybrid Q-Learning

In this section, we present our algorithm Hybrid Q Learning – Hy-Q in Algorithm 1. Hy-Q takes an ofﬂine

dataset

{Dν}H

h=1

that contains

(s, a, r, s0)

tuples and a Q function class

Fh⊂ S × A 7→ [0, H], h = 1, . . . , H,

inputs, and outputs a policy that optimizes the given reward function. The algorithm is conceptually simple: it

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

HybridRL:UsingBothOfineandOnlineDataCanMakeRLEfcientYudaSong*YifeiZhouAyushSekhariJ.AndrewBagnell§AkshayKrishnamurthy¶WenSun||March14,2023AbstractWeconsiderahybridreinforcementlearningsetting(HybridRL),inwhichanagenthasaccesstoanofinedatasetandtheabilitytocollectexperienceviareal-worldonlineint...

展开>> 收起<<

Hybrid RL Using Both Ofﬂine and Online Data Can Make RL Efﬁcient Yuda SongYifei ZhouAyush SekhariJ. Andrew BagnellAkshay KrishnamurthyWen Sun March 14 2023.pdf

共42页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Hybrid RL Using Both Ofﬂine and Online Data Can Make RL Efﬁcient Yuda SongYifei ZhouAyush SekhariJ. Andrew BagnellAkshay KrishnamurthyWen Sun March 14 2023

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: