Adaptive Behavior Cloning Regularization for Stable Ofﬂine-to-Online Reinforcement Learning Yi Zhao1 Rinu Boney2 Alexander Ilin2 Juho Kannala2 Joni Pajarinen13

2025-04-30 0 0 697.89KB 9 页 10玖币

侵权投诉

Adaptive Behavior Cloning Regularization for

Stable Ofﬂine-to-Online Reinforcement Learning

Yi Zhao1∗, Rinu Boney2∗, Alexander Ilin2, Juho Kannala2, Joni Pajarinen1,3

1- Aalto University - Department of Electrical Engineering and Automation

2- Aalto Universiity - Department of Computer Science - Finland

3- Technical University Darmstadt - Department of Computer Science - Germany

ﬁrstname.lastname@aalto.ﬁ

Abstract—Ofﬂine reinforcement learning, by learning from

a ﬁxed dataset, makes it possible to learn agent behaviors

without interacting with the environment. However, depending

on the quality of the ofﬂine dataset, such pre-trained agents

may have limited performance and would further need to be

ﬁne-tuned online by interacting with the environment. During

online ﬁne-tuning, the performance of the pre-trained agent

may collapse quickly due to the sudden distribution shift from

ofﬂine to online data. While constraints enforced by ofﬂine

RL methods such as a behaviour cloning loss prevent this to

an extent, these constraints also signiﬁcantly slow down online

ﬁne-tuning by forcing the agent to stay close to the behavior

policy. We propose to adaptively weigh the behavior cloning

loss during online ﬁne-tuning based on the agent’s performance

and training stability. Moreover, we use a randomized ensemble

of Q functions to further increase the sample efﬁciency of

online ﬁne-tuning by performing a large number of learning

updates. Experiments show that the proposed method yields

state-of-the-art ofﬂine-to-online reinforcement learning perfor-

mance on the popular D4RL benchmark. Code is available:

https://github.com/zhaoyi11/adaptive bc.

Index Terms—ofﬂine-to-online RL, ﬁne-tuning in RL

I. INTRODUCTION

Ofﬂine or batch reinforcement learning (RL) deals with the

training of RL agents from ﬁxed datasets generated by possibly

unknown behavior policies, without any interactions with the

environment. This is important in problems like robotics,

autonomous driving, and healthcare where data collection can

be expensive or dangerous. Ofﬂine RL has been challenging

for model-free RL methods due to extrapolation error where

the Q networks predict unrealistic values upon evaluations

on out-of-distribution state-action pairs [1]. Recent methods

overcome this issue by constraining the policy to stay close to

the behavior policy that generated the ofﬂine data distribution

[1]–[4], to demonstrate even better performance than the

behavior policy on several simulated and real-world tasks [5]–

[7].

However, the performance of pre-trained policies will be

limited by the quality of the ofﬂine dataset and it is often

necessary or desirable to ﬁne-tune them by interacting with

the environment. Also, ofﬂine-to-online learning reduces the

risks in online interaction as the ofﬂine pre-training results in

* Equal contribution

reasonable policies that could be tested before deployment.

In practice, ofﬂine RL methods often fail during online ﬁne-

tuning by interacting with the environment. This ofﬂine-to-

online RL setting is challenging due to: (i) the sudden distri-

bution shift from ofﬂine data to online data. This could lead

to severe bootstrapping errors which completely distorts the

pre-trained policy leading to a sudden performance drop from

the very beginning of online ﬁne-tuning, and (ii) constraints

enforced by ofﬂine RL methods on the policy to stay close

to the behavior policy. While these constraints help in dealing

with the sudden distribution shift they signiﬁcantly slow down

online ﬁne-tuning from newly collected samples.

We propose to adaptively weigh the ofﬂine RL constraints

such as behavior cloning loss during online ﬁne-tuning. This

could prevent sudden performance collapses due to the distri-

bution shift while also enabling sample-efﬁcient learning from

the newly collected samples. We propose to perform this adap-

tive weighing according to the agent’s performance and the

training stability. We start with TD3+BC, a simple ofﬂine RL

algorithm recently proposed by [4] which combines TD3 [8]

with a simple behavior cloning loss, weighted by an αhyper-

parameter. We adaptively weigh this αhyperparameter using a

control mechanism similar to the proportional–derivative (PD)

controller. The αvalue is decided based on two components:

the difference between the moving average return and the

target return (proportional term) as well as the difference

between the current episodic return and the moving average

return (derivative term).

We demonstrate that these simple modiﬁcations lead to

stable online ﬁne-tuning after ofﬂine pre-training on datasets

of different quality. We also use a randomized ensemble

of Q functions [9] to further improve the sample-efﬁciency.

We attain state-of-the-art online ﬁne-tuning performance on

locomotion tasks from the popular D4RL benchmark.

II. RELATED WORK

Ofﬂine RL. Ofﬂine RL aims to learn a policy from pre-

collected ﬁxed datasets without interacting with the environ-

ment [1], [5], [10]–[15]. Off-policy RL algorithms allow for

reuse of off-policy data [8], [16]–[21] but they typically fail

when trained ofﬂine on a ﬁxed dataset, even if it’s collected by

arXiv:2210.13846v1 [cs.LG] 25 Oct 2022

a policy trained using the same algorithm [1], [12]. In actor-

critic methods, this is due to extrapolation error of the critic

network on out-of-distribution state-action pairs [14]. Ofﬂine

RL methods deal with this by constraining the policy to stay

close to the behavioral policy that collected the ofﬂine dataset.

BRAC [22] achieves this by minimizing the Kullback-Leibler

divergence between the behavior policy and the learned policy.

BEAR [12] minimizes the MMD distance between the two

policies. TD3+BC [4] proposes a simple yet efﬁcient ofﬂine

RL algorithm by adding an additional behavior cloning loss to

the actor update. Another class of ofﬂine RL methods learns

conservative Q functions, which prevents the policy network

from exploiting out-of-distribution actions and forces them

to stay close to the behavior policy. CQL [2] changes the

critic objective to also minimize the Q function on unseen

actions. Fisher-BRC [3] achieves conservative Q learning by

constraining the gradient of the Q function on unseen data.

Model-based ofﬂine RL methods [23], [24] train policies based

on the data generated by ensembles of dynamics models

learned from ofﬂine data, while constraining the policy to

stay within samples where the dynamics model is certain. In

this paper, we focus on ofﬂine-to-online RL with the goal

of stable and sample-efﬁcient online ﬁne-tuning from policies

pre-trained on ofﬂine datasets of different quality.

Ofﬂine pre-training in RL. Pre-training has been vastly

investigated in the machine learning community from com-

puter vision [25]–[27] to natural language processing [28],

[29]. Ofﬂine pre-training in RL could enable deployment

of RL methods in domains where data collection can be

expensive or dangerous. [30]–[32] pre-train the policy network

with imitation learning to speed up RL. QT-opt [33] studies

vision-based object manipulation using a diverse and large

dataset collected by seven robots over several months and ﬁne-

tune the policy with 27K samples of online data. However,

these methods pre-train using diverse, large, or expert datasets

and it is also important to investigate the possibility of pre-

training from ofﬂine datasets of different quality. [34], [35]

use ofﬂine pre-training to accelerate downstream tasks. AWAC

[7] and Balanced Replay [36] are recent works that also

focus on ofﬂine-to-online RL from datasets of different quality.

AWAC updates the policy network such that it is constrained

during ofﬂine training while not too conservative during ﬁne-

tuning. Balanced Replay trains an additional neural network

to prioritize samples in order to effectively use new data

as well as near-on-policy samples in the ofﬂine dataset. We

compare with AWAC and Balanced Replay to attain state-of-

the-art ofﬂine-to-online RL performance on the popular D4RL

benchmark.

Ensembles in RL. Ensemble methods are widely used for

better performance in RL [37]–[40]. In model-based RL, PETS

[39] and MBPO [40] use probabilistic ensembles to effectively

model the dynamics of the environment. In model-free RL,

ensembles of Q functions have been shown to improve per-

formance [41], [42]. REDQ [9] learns a randomized ensemble

of Q functions to achieve similar sample efﬁciency as model-

based methods without learning a dynamic model. We utilize

REDQ in this work for improved sample-efﬁciency during

online ﬁne-tuning. Speciﬁc to ofﬂine RL, REM [11] uses

random convex combinations of multiple Q-value estimates

to calculate the Q targets for effective ofﬂine RL on Atari

games. MOPO [23] uses probabilistic ensembles from PETS

to learn policies from ofﬂine data using uncertainty estimates

based on model disagreement. MBOP [43] uses ensembles

of dynamic models, Q functions, and policy networks to get

better performance on locomotion tasks. Balanced Replay

[36] uses ensembles of pessimistic Q functions to mitigate

instability caused by distribution shift in ofﬂine-to-online RL.

While ensembling of Q functions has been studied by several

prior works [9], [42], we combine it with behavioral cloning

loss for the purpose of robust and sample-efﬁcient ofﬂine-to-

online RL.

Adaptive balancing of multiple objectives in RL. [44]

train policies using learned dynamics models with the ob-

jective of visiting states that most likely lead to subsequent

improvement in the dynamics model, using active online

learning. They adaptively weigh the maximization of cumu-

lative rewards and minimization of model uncertainty using

an online learning mechanism based on exponential weights

algorithm. In this paper, we focus on ofﬂine-to-online RL

using model-free methods and propose to adaptively weigh the

maximization of cumulative rewards and a behavioral cloning

loss. Exploration of other online learning algorithms such as

exponential weights algorithm is a line of future work.

III. BACKGROUND

A. Reinforcement Learning

Reinforcement learning (RL) deals with sequential decision

making to maximize cumulative rewards. RL problems are

often formalized as Markov decision processes (MDPs). An

MDP consists of a set of states S, a set of actions A, a

transition dynamics st+1 ∼p(·| st,at)that represents the

probability of transitioning to a state st+1 by taking action

atin state stat timestep t, a scalar reward function rt=

R(st,at), and a discount factor γ∈[0,1].

A policy function πof an RL agent is a mapping from states

to actions and deﬁnes the behavior of the agent. The value

function Vπ(s)of a policy πis deﬁned as the expected cumula-

tive rewards from state s:Vπ(s) = E[P∞

t=0 γtR(st,at)|s0=

s], where the expectation is taken over state transitions

st+1 ∼p(·| st,at)and policy function at∼π(st). Similarly,

the state-action value function Qπ(s,a)is deﬁned as the

expected cumulative rewards after taking action ain state s:

Qπ(s,a) = E[P∞

t=0 γtR(st,at)|s0=s,a0=a]. The goal of

RL is to learn an optimal policy function πθwith parameters

θ, that maximizes the expected cumulative rewards:

πθ= arg maxθEs∼S hQπθ(s, πθ(s))i.

We use the TD3 algorithm for reinforcement learning [8].

TD3 is an actor-critic method that alternatingly trains: (i) the

critic network Qφto estimate the Qπθ(s,a)values of the

policy network πθ, and (ii) the policy network to produce

actions that maximize the Q function: ∇θQφ(s, πθ(s)).

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

AdaptiveBehaviorCloningRegularizationforStableOfine-to-OnlineReinforcementLearningYiZhao1,RinuBoney2,AlexanderIlin2,JuhoKannala2,JoniPajarinen1;31-AaltoUniversity-DepartmentofElectricalEngineeringandAutomation2-AaltoUniversiity-DepartmentofComputerScience-Finland3-TechnicalUniversityDarmstadt-Dep...

展开>> 收起<<

Adaptive Behavior Cloning Regularization for Stable Ofﬂine-to-Online Reinforcement Learning Yi Zhao1 Rinu Boney2 Alexander Ilin2 Juho Kannala2 Joni Pajarinen13.pdf

共9页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Adaptive Behavior Cloning Regularization for Stable Ofﬂine-to-Online Reinforcement Learning Yi Zhao1 Rinu Boney2 Alexander Ilin2 Juho Kannala2 Joni Pajarinen13

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: