Adaptive Behavior Cloning Regularization for Stable Offline-to-Online Reinforcement Learning Yi Zhao1 Rinu Boney2 Alexander Ilin2 Juho Kannala2 Joni Pajarinen13

2025-04-30 0 0 697.89KB 9 页 10玖币
侵权投诉
Adaptive Behavior Cloning Regularization for
Stable Offline-to-Online Reinforcement Learning
Yi Zhao1, Rinu Boney2, Alexander Ilin2, Juho Kannala2, Joni Pajarinen1,3
1- Aalto University - Department of Electrical Engineering and Automation
2- Aalto Universiity - Department of Computer Science - Finland
3- Technical University Darmstadt - Department of Computer Science - Germany
firstname.lastname@aalto.fi
Abstract—Offline reinforcement learning, by learning from
a fixed dataset, makes it possible to learn agent behaviors
without interacting with the environment. However, depending
on the quality of the offline dataset, such pre-trained agents
may have limited performance and would further need to be
fine-tuned online by interacting with the environment. During
online fine-tuning, the performance of the pre-trained agent
may collapse quickly due to the sudden distribution shift from
offline to online data. While constraints enforced by offline
RL methods such as a behaviour cloning loss prevent this to
an extent, these constraints also significantly slow down online
fine-tuning by forcing the agent to stay close to the behavior
policy. We propose to adaptively weigh the behavior cloning
loss during online fine-tuning based on the agent’s performance
and training stability. Moreover, we use a randomized ensemble
of Q functions to further increase the sample efficiency of
online fine-tuning by performing a large number of learning
updates. Experiments show that the proposed method yields
state-of-the-art offline-to-online reinforcement learning perfor-
mance on the popular D4RL benchmark. Code is available:
https://github.com/zhaoyi11/adaptive bc.
Index Terms—offline-to-online RL, fine-tuning in RL
I. INTRODUCTION
Offline or batch reinforcement learning (RL) deals with the
training of RL agents from fixed datasets generated by possibly
unknown behavior policies, without any interactions with the
environment. This is important in problems like robotics,
autonomous driving, and healthcare where data collection can
be expensive or dangerous. Offline RL has been challenging
for model-free RL methods due to extrapolation error where
the Q networks predict unrealistic values upon evaluations
on out-of-distribution state-action pairs [1]. Recent methods
overcome this issue by constraining the policy to stay close to
the behavior policy that generated the offline data distribution
[1]–[4], to demonstrate even better performance than the
behavior policy on several simulated and real-world tasks [5]–
[7].
However, the performance of pre-trained policies will be
limited by the quality of the offline dataset and it is often
necessary or desirable to fine-tune them by interacting with
the environment. Also, offline-to-online learning reduces the
risks in online interaction as the offline pre-training results in
* Equal contribution
reasonable policies that could be tested before deployment.
In practice, offline RL methods often fail during online fine-
tuning by interacting with the environment. This offline-to-
online RL setting is challenging due to: (i) the sudden distri-
bution shift from offline data to online data. This could lead
to severe bootstrapping errors which completely distorts the
pre-trained policy leading to a sudden performance drop from
the very beginning of online fine-tuning, and (ii) constraints
enforced by offline RL methods on the policy to stay close
to the behavior policy. While these constraints help in dealing
with the sudden distribution shift they significantly slow down
online fine-tuning from newly collected samples.
We propose to adaptively weigh the offline RL constraints
such as behavior cloning loss during online fine-tuning. This
could prevent sudden performance collapses due to the distri-
bution shift while also enabling sample-efficient learning from
the newly collected samples. We propose to perform this adap-
tive weighing according to the agent’s performance and the
training stability. We start with TD3+BC, a simple offline RL
algorithm recently proposed by [4] which combines TD3 [8]
with a simple behavior cloning loss, weighted by an αhyper-
parameter. We adaptively weigh this αhyperparameter using a
control mechanism similar to the proportional–derivative (PD)
controller. The αvalue is decided based on two components:
the difference between the moving average return and the
target return (proportional term) as well as the difference
between the current episodic return and the moving average
return (derivative term).
We demonstrate that these simple modifications lead to
stable online fine-tuning after offline pre-training on datasets
of different quality. We also use a randomized ensemble
of Q functions [9] to further improve the sample-efficiency.
We attain state-of-the-art online fine-tuning performance on
locomotion tasks from the popular D4RL benchmark.
II. RELATED WORK
Offline RL. Offline RL aims to learn a policy from pre-
collected fixed datasets without interacting with the environ-
ment [1], [5], [10]–[15]. Off-policy RL algorithms allow for
reuse of off-policy data [8], [16]–[21] but they typically fail
when trained offline on a fixed dataset, even if it’s collected by
arXiv:2210.13846v1 [cs.LG] 25 Oct 2022
a policy trained using the same algorithm [1], [12]. In actor-
critic methods, this is due to extrapolation error of the critic
network on out-of-distribution state-action pairs [14]. Offline
RL methods deal with this by constraining the policy to stay
close to the behavioral policy that collected the offline dataset.
BRAC [22] achieves this by minimizing the Kullback-Leibler
divergence between the behavior policy and the learned policy.
BEAR [12] minimizes the MMD distance between the two
policies. TD3+BC [4] proposes a simple yet efficient offline
RL algorithm by adding an additional behavior cloning loss to
the actor update. Another class of offline RL methods learns
conservative Q functions, which prevents the policy network
from exploiting out-of-distribution actions and forces them
to stay close to the behavior policy. CQL [2] changes the
critic objective to also minimize the Q function on unseen
actions. Fisher-BRC [3] achieves conservative Q learning by
constraining the gradient of the Q function on unseen data.
Model-based offline RL methods [23], [24] train policies based
on the data generated by ensembles of dynamics models
learned from offline data, while constraining the policy to
stay within samples where the dynamics model is certain. In
this paper, we focus on offline-to-online RL with the goal
of stable and sample-efficient online fine-tuning from policies
pre-trained on offline datasets of different quality.
Offline pre-training in RL. Pre-training has been vastly
investigated in the machine learning community from com-
puter vision [25]–[27] to natural language processing [28],
[29]. Offline pre-training in RL could enable deployment
of RL methods in domains where data collection can be
expensive or dangerous. [30]–[32] pre-train the policy network
with imitation learning to speed up RL. QT-opt [33] studies
vision-based object manipulation using a diverse and large
dataset collected by seven robots over several months and fine-
tune the policy with 27K samples of online data. However,
these methods pre-train using diverse, large, or expert datasets
and it is also important to investigate the possibility of pre-
training from offline datasets of different quality. [34], [35]
use offline pre-training to accelerate downstream tasks. AWAC
[7] and Balanced Replay [36] are recent works that also
focus on offline-to-online RL from datasets of different quality.
AWAC updates the policy network such that it is constrained
during offline training while not too conservative during fine-
tuning. Balanced Replay trains an additional neural network
to prioritize samples in order to effectively use new data
as well as near-on-policy samples in the offline dataset. We
compare with AWAC and Balanced Replay to attain state-of-
the-art offline-to-online RL performance on the popular D4RL
benchmark.
Ensembles in RL. Ensemble methods are widely used for
better performance in RL [37]–[40]. In model-based RL, PETS
[39] and MBPO [40] use probabilistic ensembles to effectively
model the dynamics of the environment. In model-free RL,
ensembles of Q functions have been shown to improve per-
formance [41], [42]. REDQ [9] learns a randomized ensemble
of Q functions to achieve similar sample efficiency as model-
based methods without learning a dynamic model. We utilize
REDQ in this work for improved sample-efficiency during
online fine-tuning. Specific to offline RL, REM [11] uses
random convex combinations of multiple Q-value estimates
to calculate the Q targets for effective offline RL on Atari
games. MOPO [23] uses probabilistic ensembles from PETS
to learn policies from offline data using uncertainty estimates
based on model disagreement. MBOP [43] uses ensembles
of dynamic models, Q functions, and policy networks to get
better performance on locomotion tasks. Balanced Replay
[36] uses ensembles of pessimistic Q functions to mitigate
instability caused by distribution shift in offline-to-online RL.
While ensembling of Q functions has been studied by several
prior works [9], [42], we combine it with behavioral cloning
loss for the purpose of robust and sample-efficient offline-to-
online RL.
Adaptive balancing of multiple objectives in RL. [44]
train policies using learned dynamics models with the ob-
jective of visiting states that most likely lead to subsequent
improvement in the dynamics model, using active online
learning. They adaptively weigh the maximization of cumu-
lative rewards and minimization of model uncertainty using
an online learning mechanism based on exponential weights
algorithm. In this paper, we focus on offline-to-online RL
using model-free methods and propose to adaptively weigh the
maximization of cumulative rewards and a behavioral cloning
loss. Exploration of other online learning algorithms such as
exponential weights algorithm is a line of future work.
III. BACKGROUND
A. Reinforcement Learning
Reinforcement learning (RL) deals with sequential decision
making to maximize cumulative rewards. RL problems are
often formalized as Markov decision processes (MDPs). An
MDP consists of a set of states S, a set of actions A, a
transition dynamics st+1 p(·| st,at)that represents the
probability of transitioning to a state st+1 by taking action
atin state stat timestep t, a scalar reward function rt=
R(st,at), and a discount factor γ[0,1].
A policy function πof an RL agent is a mapping from states
to actions and defines the behavior of the agent. The value
function Vπ(s)of a policy πis defined as the expected cumula-
tive rewards from state s:Vπ(s) = E[P
t=0 γtR(st,at)|s0=
s], where the expectation is taken over state transitions
st+1 p(·| st,at)and policy function atπ(st). Similarly,
the state-action value function Qπ(s,a)is defined as the
expected cumulative rewards after taking action ain state s:
Qπ(s,a) = E[P
t=0 γtR(st,at)|s0=s,a0=a]. The goal of
RL is to learn an optimal policy function πθwith parameters
θ, that maximizes the expected cumulative rewards:
πθ= arg maxθEs∼S hQπθ(s, πθ(s))i.
We use the TD3 algorithm for reinforcement learning [8].
TD3 is an actor-critic method that alternatingly trains: (i) the
critic network Qφto estimate the Qπθ(s,a)values of the
policy network πθ, and (ii) the policy network to produce
actions that maximize the Q function: θQφ(s, πθ(s)).
摘要:

AdaptiveBehaviorCloningRegularizationforStableOfine-to-OnlineReinforcementLearningYiZhao1,RinuBoney2,AlexanderIlin2,JuhoKannala2,JoniPajarinen1;31-AaltoUniversity-DepartmentofElectricalEngineeringandAutomation2-AaltoUniversiity-DepartmentofComputerScience-Finland3-TechnicalUniversityDarmstadt-Dep...

展开>> 收起<<
Adaptive Behavior Cloning Regularization for Stable Offline-to-Online Reinforcement Learning Yi Zhao1 Rinu Boney2 Alexander Ilin2 Juho Kannala2 Joni Pajarinen13.pdf

共9页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:9 页 大小:697.89KB 格式:PDF 时间:2025-04-30

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 9
客服
关注