Boosting Offline Reinforcement Learning via Data Rebalancing Yang Yue1 2 Bingyi Kang1 Xiao Ma1Zhongwen Xu1Gao Huang2Shuicheng Yan1 Abstract

2025-05-06 0 0 698.67KB 8 页 10玖币
侵权投诉
Boosting Offline Reinforcement Learning via Data Rebalancing
Yang Yue 1 2 * Bingyi Kang 1 Xiao Ma 1Zhongwen Xu 1Gao Huang 2Shuicheng Yan 1
Abstract
Offline reinforcement learning (RL) is challenged
by the distributional shift between learning poli-
cies and datasets. To address this problem, ex-
isting works mainly focus on designing sophisti-
cated algorithms to explicitly or implicitly con-
strain the learned policy to be close to the be-
havior policy. The constraint applies not only to
well-performing actions but also to inferior ones,
which limits the performance upper bound of the
learned policy. Instead of aligning the densities
of two distributions, aligning the supports gives a
relaxed constraint while still being able to avoid
out-of-distribution actions. Therefore, we pro-
pose a simple yet effective method to boost offline
RL algorithms based on the observation that re-
sampling a dataset keeps the distribution support
unchanged. More specifically, we construct a bet-
ter behavior policy by resampling each transition
in an old dataset according to its episodic return.
We dub our method ReD (Return-based Data Re-
balance), which can be implemented with less
than 10 lines of code change and adds negligible
running time. Extensive experiments demonstrate
that ReD is effective at boosting offline RL per-
formance and orthogonal to decoupling strategies
in long-tailed classification. New state-of-the-arts
are achieved on the D4RL benchmark.
1. Introduction
Recent advances in Deep Reinforcement Learning (DRL)
have achieved great success in various challenging decision-
making applications, such as board games (Schrittwieser
et al.,2020) and strategy games (Vinyals et al.,2019). How-
ever, DRL naturally works in an online paradigm where
agents need to actively interact with environments for ex-
perience collection. This hinders DRL from applications in
1
Sea AI Lab
2
Department of Automation, BNRist, Ts-
inghua University. Correspondence to: Yang Yue
<
le-
y22@mails.tsinghua.edu.cn
>
.
*
This work was done when Yang
Yue was an intern at Sea AI Lab. Corresponding Author.
the real-world scenarios where interactions are prohibitively
expensive and dangerous. Offline Reinforcement learning
attempts to address the problem by learning from previously
collected data, which allows utilizing large datasets to train
agents (Lange et al.,2012). Vanilla off-policy RL algorithms
suffer poor performance in the offline setting due to the
distributional shift problem (Fujimoto et al.,2019). Specif-
ically, performing policy evaluation, i.e., updating value
function with Bellman’s equations, involves querying the
value of out-of-distribution (OOD) state-action pairs, which
potentially leads to accumulative extrapolation error. The
main class of existing methods alleviates the above problem
via constraining the learned policy not to deviate far from
the behavior policy by
directly restricting their probabil-
ity densities
. The constraint can be KL divergence (Jaques
et al.,2019;Peng et al.,2019), Wasserstein distance (Wu
et al.,2019), maximum mean discrepancy (MMD) (Kumar
et al.,2019), or behavior cloning regularization (Fujimoto
& Gu,2021).
However, such constraints might be too restrictive as the
learned policy is forced to mimic both bad and good actions
of the behavior policy. For instance, consider a dataset
D
for state space
S
and action space
A={a1, a2, a3}
col-
lected with behavior policy
β
. At one specific state
s
, the
policy
β
assign probability
0.2
to action
a1
, 0.8 to
a2
and
zero density to
a3
. However,
a1
would lead to much higher
expected return than
a2
. Minimizing the density distance
of two policies can avoid
a3
, but forces the learned policy
to choose
a2
over
a1
, resulting in much worse performance.
Therefore, a more reasonable condition is to constrain two
policy distributions to have the same support of action, i.e.,
the learned policy has positive density only on actions that
give non-zero probability in the behavior policy (Kumar
et al.,2019). In this case, it regularizes the learning pol-
icy to sample in-distribution state-action pairs and gives
a higher performance upper bound. We term it
support
alignment
as a more flexible relaxation of the behavior
regularization. Nevertheless, explicit support alignment is
intractable in practice (Kumar et al.,2019), especially for
high-dimensional continuous action spaces.
We make one important observation that reweighting the
data distribution density does not change the support of
the data distribution, i.e., zero density is still zero density
after reweighting. In the context of offline RL, instead of
arXiv:2210.09241v1 [cs.LG] 17 Oct 2022
sampling uniformly from the offline dataset, varying the
sampling rate of offline samples to focus on trajectories
with higher accumulative returns, i.e., changing the action
density, does not change the support and produces a better
behavior policy. Matching the learned policy with a resam-
pled policy is approximately performing support alignment.
In this work, we boost offline RL by designing data rebalanc-
ing strategies to construct better behavior policies. We first
show that existing offline datasets are extremely imbalanced
in terms of episodic return (as shown in Fig. 1). In some
datasets, most actions lead to a low return, which renders
the possibility that current density-based constraints are too
restrictive. We thus propose to resample the dataset dur-
ing training based on episodic return, which assigns larger
weights to transitions with higher returns. The method is
thus dubbed Return-based Data Rebalance (ReD), and can
be easily implemented with less than 10 lines of code. With-
out any modification to prior hyperparameters, we find that
ReD effectively boosts the performance of various popular
offline RL algorithms by a large margin on diverse domains
in D4RL (Brockman et al.,2016;Fu et al.,2020). Then as
our minor contribution, we propose a more elaborated im-
plementation of data rebalance, Decoupled ReD (DeReD),
inspired by decoupling strategies for data rebalance train-
ing in long-tailed classification (Kang et al.,2020). The
proposed DeReD combined with IQL achieves the state-of-
the-art performance on D4RL. The effectiveness of return-
based data rebalance may imply that data dimension is as
important as algorithmic dimension in offline RL.
hopper-medium-replay-v2 walker2d-medium-replay-v2
hopper-medium-expert-v2 walker2d-medium-expert-v2
Figure 1.
Visualization of Trajectory Return Distributions.
Medium-replay datasets are likely to have a long-tailed distribu-
tion, and medium-expert are likely to have two peaks.
2. Related Work
Offline RL.
To alleviate extrapolation error and address
the distributional shift problem, a general framework for
prior offline RL works is to constrain the learned policy
to stay close to the behavior policy. Considering KL-
divergence is easy to accurately compute under the Gaussian
distribution assumption, many works choose KL-divergence
as policy constraint. There are many concrete implemen-
tation choices, e.g., explicitly modeling behavior prior by
VAE, avoiding explicit modeling by the dual form (Wu
et al.,2019;Jaques et al.,2019). Exponentially advantage-
weighted regression, an implicit form of KL-divergence con-
straint, is derived by AWR (Peng et al.,2019), CRR (Wang
et al.,2020) and AWAC (Nair et al.,2020). IQL (Kostrikov
et al.,2021b) also extracts policy via advantage-weighted
regression from the expectile value function, enforcing a
KL constraint. Behavior cloning (BC) is another alternative
to implement constraint (Fujimoto & Gu,2021). There also
exist other directions to realize the offline RL. One line is
to regularize Q-function by conservative estimate (Kumar
et al.,2020;Buckman et al.,2020). Surprisingly, our experi-
ments show that data rebalance also works with conservative
Q-learning (Kumar et al.,2020). Another line is to view
offline RL as the sequential modeling problem by masked
transformer (Chen et al.,2021;Janner et al.,2021), and then
the transformer outputs actions to attain the given return.
Some works attempt to approximately satisfy support align-
ment. BEAR (Kumar et al.,2019) utilizes maximum mean
discrepancy to approximately optimize support alignment.
However, the effectiveness that the sampled MMD has in
constraining two distributions to the same support is only
empirically shown in a low-dimension distribution with di-
agonal covariance matrices with no theoretical guarantee,
and MMD is extremely complex to implement. That’s a
possible reason why Wu et al. finds MMD has no gain
compared to KL. Another attempt to relax the restrictive
constraint is adaptively adjusting the weight of the constraint
term by dual gradient ascent. However, Wu et al. observes
that the adaptive weight has a modest disadvantage over a
fixed one.
Rebalance Data.
Dataset rebalance is widely used in vi-
sual tasks when facing a long-tailed distribution (Zhang
et al.,2021). For decision making, imitation learning (IL)
aims to learn from demonstration, where rebalance is natu-
rally applied to filter out bad demonstrations. BAIL (Chen
et al.,2020) employs a neural network to approximate the
upper envelope (i.e., the optimal return from data) and select
good state-action pairs to imitate. Another rebalance in IL
is 10%BC where behavior cloning only uses the top 10%
of transitions ordered by episode return (Chen et al.,2021).
MARWIL (Wang et al.,2018) and AWR (Peng et al.,2019)
employ exponentially advantage-weighted behavior cloning,
equivalent to policy improvement step with KL constraint in
RL. Experiments show that our method can further improve
KL constraint methods.
摘要:

BoostingOfineReinforcementLearningviaDataRebalancingYangYue12*BingyiKang1†XiaoMa1ZhongwenXu1GaoHuang2ShuichengYan1AbstractOfinereinforcementlearning(RL)ischallengedbythedistributionalshiftbetweenlearningpoli-ciesanddatasets.Toaddressthisproblem,ex-istingworksmainlyfocusondesigningsophisti-catedalg...

展开>> 收起<<
Boosting Offline Reinforcement Learning via Data Rebalancing Yang Yue1 2 Bingyi Kang1 Xiao Ma1Zhongwen Xu1Gao Huang2Shuicheng Yan1 Abstract.pdf

共8页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:8 页 大小:698.67KB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 8
客服
关注